I create a completely new index and write it to ES using the Spark/Hadoop library from Elastic.
I need a pretty big cluster to accomplish the calculations that create the index and the last step it to write it to Elasticsearch. The problem is that the cluster seems to overload the ES cluster.
- How should I control this situation, large compute cluster writing to smaller ES cluster?
- can I throttle writing in some way using the ES Hadoop/Spark lib?
- I was told elsewhere that I could change how many documents I write in one post and that would make the write go faster, is this true and what parameter controls this?
- is there any other way to make this more efficient like shutting down indexing until the write is done? Since I write then swap an alias I can refresh/reindex before the swap is done.
Here is an example of the errors I get. Usually the 3-times retry causes the job to succeed eventually but so many tasks fail that it is wasting a lot of time with retries.
2017-03-20 10:48:27,745 WARN org.apache.spark.scheduler.TaskSetManager [task-result-getter-2] - Lost task 568.1 in stage 81.0 (TID 18982, ip-172-16-2-76.ec2.internal): org.apache.spark.util.TaskCompletionListenerException: Could not write all entries [41/87360] (maybe ES was overloaded?). Bailing out...