Spark 2.4 to Elasticsearch : prevent data loss during dataproc nodes decommissioning?

fredrouvier · January 22, 2020, 2:04pm

My technical task is to synchronize data from GCS (Google Cloud Storage) to our Elasticsearch cluster. We use Apache Spark 2.4 with the Elastic Hadoop connector on a Google dataproc cluster (autoscaling enabled).

During the execution, if the dataproc cluster downscaled, all tasks on the decommissioned node are lost and the processed data on this node are never pushed to elastic.

This problem does not exist when I save to GCS or HDFS for example.

How to make resilient this task even when nodes are decommissioned ?

An extract of the stacktrace :

Lost task 50.0 in stage 2.3 (TID 427, xxxxxxx-sw-vrb7.c.xxxxxxx, executor 43): FetchFailed(BlockManagerId(30, xxxxxxx-w-23.c.xxxxxxx, 7337, None), shuffleId=0, mapId=26, reduceId=170, message=org.apache.spark.shuffle.FetchFailedException: Failed to connect to xxxxxxx-w-23.c.xxxxxxx:7337

Caused by: java.net.UnknownHostException: xxxxxxx-w-23.c.xxxxxxx

Task 50.0 in stage 2.3 (TID 427) failed, but the task will not be re-executed (either because the task failed with a shuffle data fetch failure, so the previous stage needs to be re-run, or because a different copy of the task has already succeeded).

Thanks. Fred

james.baiera · January 28, 2020, 9:51pm

This looks more like an issue about Spark than ES-Hadoop specifically. If the workers executing the spark job tasks disappear, it is up to Spark executor framework to reschedule them appropriately. Do you have an errors or exception stack traces that reference ES-Hadoop?

system · February 25, 2020, 9:59pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
ES-Hadoop and Spark. It works bad when you miss a ES node Elasticsearch es-hadoop	3	1053	July 6, 2017
Elasticsearch Spark EsHadoopNoNodesLeftException in cluster Mode Elasticsearch	7	7471	July 5, 2017
How to handle data that causes failure while indexing from spark to ES Elasticsearch es-hadoop	2	2055	October 10, 2017
Elasticsearch-Hadoop Data Locality Elasticsearch	2	960	July 6, 2017
Install dependencies on GCP cloud dataproc Elasticsearch es-hadoop	1	658	May 21, 2020

Spark 2.4 to Elasticsearch : prevent data loss during dataproc nodes decommissioning?

Related topics