Hi!
We recently got switched from a large (8 node, 21 GB JVM heap memory) cluster to a smaller (3 node, 9 GB JVM heap memory) cluster and now we getting errors while trying to reupload our Elasticsearch indices. It occasionally passes, but for most runs, we're getting a circuit breaking exception. We didn't have this issue on the larger cluster.
circuit_breaking_exception: [parent] Data too large, data for [<http_request>] would be [4080680148/3.8gb], which is larger than the limit of [4080218931/3.7gb], real usage: [4080679872/3.8gb], new bytes reserved: [276/276b], usages [inflight_requests=630/630b, request=0/0b, fielddata=3493/3.4kb, eql_sequence=0/0b, model_inference=0/0b]
As we know, the circuit breaker feature is there to prevent an out of memory exception, so it is good that it is preventing that. I don't understand however how we're supposed to get it to stay within the memory limits.
The official documentation mentions decreasing the bulk size, but changing these parameters to half of the the default values seems does not seem to have an impact.
This our code using the Python DataFrame API. The documentation mostly mentions the legacy RDD API, but other sources hint that this is also using the bulk API.
data.write.format("org.elasticsearch.spark.sql")
.options(**get_es_conf())
.mode("overwrite")
.save(index)
The Databricks job cluster
Driver: Standard_DS3_v2 · Workers: Standard_DS3_v2 · 2 workers · 12.2 LTS (includes Apache Spark 3.3.2, Scala 2.12)
The workflow task has a dependency the driver library org.elasticsearch:elasticsearch-spark-30_2.12:8.6.2
The Elasticsearch cluster is running Elasticsearch version 8.6.2.
How can we configure it so that successfully uploads the new data and doesn't trigger the circuit breaker?