I'm using ES 7.1.1 and Spark 2.4.2. The ES cluster is on Google Kubernetes Engine and the Spark cluster is on Google Dataproc.
Big jobs are failing with the following error, often several hours into the job:
19/06/25 08:46:15 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 5.0 in stage 2.0 (TID 556, cluster.name, executor 3): org.apache.spark.util.TaskCompletionListenerException: org.elasticsearch.hadoop.rest.EsHadoopRemoteException: circuit_breaking_exception: [parent] Data too large, data for [<http_request>] would be [5127135952/4.7gb], which is larger than the limit of [5067151769/4.7gb], real usage: [5127135952/4.7gb], new bytes reserved: [0/0b]
It then prints the batch request, which is very large.
Any ideas on how to prevent this kind of error? It looks to me like ES is not keeping up with the rate of requests, so memory usage is increasing until requests are rejected.
In this case, it would be nice if Spark slowed down. It looks like retries are enabled, but the back-off time doesn't appear to increase.
Any tips on resolving this problem?