I'm using ES 7.1.1 and Spark 2.4.2. The ES cluster is on Google Kubernetes Engine and the Spark cluster is on Google Dataproc.
Big jobs are failing with the following error, often several hours into the job:
19/06/25 08:46:15 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 5.0 in stage 2.0 (TID 556, cluster.name, executor 3): org.apache.spark.util.TaskCompletionListenerException: org.elasticsearch.hadoop.rest.EsHadoopRemoteException: circuit_breaking_exception: [parent] Data too large, data for [<http_request>] would be [5127135952/4.7gb], which is larger than the limit of [5067151769/4.7gb], real usage: [5127135952/4.7gb], new bytes reserved: [0/0b]
It then prints the batch request, which is very large.
Any ideas on how to prevent this kind of error? It looks to me like ES is not keeping up with the rate of requests, so memory usage is increasing until requests are rejected.
In this case, it would be nice if Spark slowed down. It looks like retries are enabled, but the back-off time doesn't appear to increase.
This sort of request could be caused by a number of things, taking a look at the message:
circuit_breaking_exception: [parent] Data too large, data for [<http_request>] would be [5127135952/4.7gb], which is larger than the limit of [5067151769/4.7gb], real usage: [5127135952/4.7gb], new bytes reserved: [0/0b]
So in this case, the "parent" breaker was tripped, the parent breaker is the sum of all the other breakers, the first thing to do in this case is to check the nodes stats API with:
GET /_nodes/stats/breaker?human&pretty
This will return all the breakers for that node, you can then see if any of the other breakers are contributing to the limit causing the breaker to trip.
Next, since this is 7.1, the real memory circuit breaker samples the actual memory usage of ES to try and prevent an OutOfMemoryError, so if the breakers don't tell you where the memory is being used, it may be good to check how large of a request you are sending to ES.
We figured out the issue: the analyzers we were using unnecessarily included a "completion" analyzer that uses a lot of JVM memory for large indices. Removing this analyzer resolved the problem.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.