How to avoid Data too large error for org.apache.spark.sql.streaming.StreamingQueryException

yuecong · July 15, 2019, 7:44pm

I am using spark streaming to dump data from Kafka to ES and I got the following errors.

org.apache.spark.sql.streaming.StreamingQueryException: Job aborted due to stage failure: Task 6 in stage 4888.0 failed 4 times, most recent failure: Lost task 6.3 in stage 4888.0 (TID 58565, 10.139.64.27, executor 256): org.apache.spark.util.TaskCompletionListenerException: org.elasticsearch.hadoop.rest.EsHadoopRemoteException: circuit_breaking_exception: [parent] Data too large, data for [<http_request>] would be [6104859024/5.6gb], which is larger than the limit of [6103767449/5.6gb], real usage: [6104859024/5.6gb], new bytes reserved: [0/0b]

Could someone suggest how I can adjust some parameters to avoid this?

Here is my es configurations to write to ES.
val esURL = "xxxx"
serviceLogDfForES.writeStream
.outputMode("append")
.format("org.elasticsearch.spark.sql")
.option("es.nodes.wan.only","true")
.option("es.port","9200")
.option("es.net.http.auth.user", "xxx")
.option("es.net.http.auth.pass", "xxx")
.option("checkpointLocation", "/mnt/xxxx/_checkpoint1")
.option("es.net.ssl","true")
.option("es.net.ssl.cert.allow.self.signed", "true")
.option("es.mapping.date.rich", "true")
.option("es.nodes", esURL)
.option("es.resource.write", "service-log-{date}")
.start().awaitTermination()

yuecong · July 15, 2019, 8:10pm

I checked my cluster, and it works fine

{
"cluster_name" : "logs001",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 8,
"number_of_data_nodes" : 3,
"active_primary_shards" : 14,
"active_shards" : 29,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 100.0
}

Also if I rerun the spark streaming job, it started to work.

yuecong · July 17, 2019, 5:57am

@james.baiera do you have some insights on this? Thanks

system · August 14, 2019, 5:57am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
ES Spark Connector - Circuit Breaker error Elasticsearch es-hadoop	3	1280	July 30, 2019
EsSparkSQL.saveToEs() CircuitBreakingException: [parent] Data too large, data for [<transport_request>] Elasticsearch es-hadoop	1	1085	January 15, 2020
Data too large, data for [<transport_request>] - CircuitBreakingException Elasticsearch es-hadoop	1	1275	July 2, 2020
Spark structured streaming Elasticsearch integration issue Elasticsearch es-hadoop	2	877	July 11, 2019
Circuit_breaking_exception [parent] Data too large, data for [<http_request>] Elasticsearch es-hadoop	1	2309	July 29, 2020

How to avoid Data too large error for org.apache.spark.sql.streaming.StreamingQueryException

Related topics