Spark 2.4 to Elasticsearch : prevent data loss during dataproc nodes decommissioning?

My technical task is to synchronize data from GCS (Google Cloud Storage) to our Elasticsearch cluster. We use Apache Spark 2.4 with the Elastic Hadoop connector on a Google dataproc cluster (autoscaling enabled).

During the execution, if the dataproc cluster downscaled, all tasks on the decommissioned node are lost and the processed data on this node are never pushed to elastic.

This problem does not exist when I save to GCS or HDFS for example.

How to make resilient this task even when nodes are decommissioned ?

An extract of the stacktrace :

Lost task 50.0 in stage 2.3 (TID 427, xxxxxxx-sw-vrb7.c.xxxxxxx, executor 43): FetchFailed(BlockManagerId(30, xxxxxxx-w-23.c.xxxxxxx, 7337, None), shuffleId=0, mapId=26, reduceId=170, message=org.apache.spark.shuffle.FetchFailedException: Failed to connect to xxxxxxx-w-23.c.xxxxxxx:7337

Caused by: java.net.UnknownHostException: xxxxxxx-w-23.c.xxxxxxx

Task 50.0 in stage 2.3 (TID 427) failed, but the task will not be re-executed (either because the task failed with a shuffle data fetch failure, so the previous stage needs to be re-run, or because a different copy of the task has already succeeded).

Thanks. Fred

This looks more like an issue about Spark than ES-Hadoop specifically. If the workers executing the spark job tasks disappear, it is up to Spark executor framework to reschedule them appropriately. Do you have an errors or exception stack traces that reference ES-Hadoop?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.