Hi Costin,
Just wanted to check on the feasibility of actually using es-hadoop(spark) and elasticsearch to update a document count of 80 million spread among 32 shards. With the basic default configuration as per documentation, a 6 six executor spark cluster tried to update documents with the mentioned count eventually ran into the following error:
Lost task 20.1 in stage 1.0 (TID 32, ip-10-0-2-240.ec2.internal): org.apache.spark.util.TaskCompletionListenerException: SearchPhaseExecutionException[Failed to execute phase [init_scan], all shards failed]
at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:83)
at org.apache.spark.scheduler.Task.run(Task.scala:72)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Also, the spark job when started was giving me a throughput of 3 lakhs records per minute which eventually after 4 hours came down to 2 lakhs records per minute and eventually it died.
First question is it actually recommended to update such a huge count in elasticsearch using es-hadoop connector? If yes, then what's that I am missing and what's the best way to do that. If not what else is the recommendation from elasticsearch