How to avoid Data too large error for org.apache.spark.sql.streaming.StreamingQueryException

I am using spark streaming to dump data from Kafka to ES and I got the following errors.

org.apache.spark.sql.streaming.StreamingQueryException: Job aborted due to stage failure: Task 6 in stage 4888.0 failed 4 times, most recent failure: Lost task 6.3 in stage 4888.0 (TID 58565,, executor 256): org.apache.spark.util.TaskCompletionListenerException: circuit_breaking_exception: [parent] Data too large, data for [<http_request>] would be [6104859024/5.6gb], which is larger than the limit of [6103767449/5.6gb], real usage: [6104859024/5.6gb], new bytes reserved: [0/0b]

Could someone suggest how I can adjust some parameters to avoid this?

Here is my es configurations to write to ES.
val esURL = "xxxx"
.option("", "xxx")
.option("", "xxx")
.option("checkpointLocation", "/mnt/xxxx/_checkpoint1")
.option("", "true")
.option("", "true")
.option("es.nodes", esURL)
.option("es.resource.write", "service-log-{date}")

I checked my cluster, and it works fine

"cluster_name" : "logs001",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 8,
"number_of_data_nodes" : 3,
"active_primary_shards" : 14,
"active_shards" : 29,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 100.0

Also if I rerun the spark streaming job, it started to work.

@james.baiera do you have some insights on this? Thanks

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.