Upsert from Spark - 504 after 1.5 hours

e have an index of 250M documents and are trying to update 55M of them from Spark to ElasticSearch.
The initial write of 250M on it's own runs without issue. Unfortunately the 55M updates are getting a few timeouts after around 1.5hrs of running.
The update is happening through es.write.operation => update, and we are specifying es.mapping.id

17/08/16 02:20:55 ERROR Executor: Exception in task 53.0 in stage 3.0 (TID 57)
org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: [PUT] on [{index}/{mapping}/_bulk] failed; server[{removed}:443] returned [504|null:]
at org.elasticsearch.hadoop.rest.RestClient.checkResponse(RestClient.java:505)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:463)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:445)
at org.elasticsearch.hadoop.rest.RestClient.bulk(RestClient.java:186)
at org.elasticsearch.hadoop.rest.RestRepository.tryFlush(RestRepository.java:220)
at org.elasticsearch.hadoop.rest.RestRepository.flush(RestRepository.java:242)
at org.elasticsearch.hadoop.rest.RestRepository.doWriteToIndex(RestRepository.java:196)
at org.elasticsearch.hadoop.rest.RestRepository.writeToIndex(RestRepository.java:159)
at org.elasticsearch.spark.rdd.EsRDDWriter.write(EsRDDWriter.scala:67)
at org.elasticsearch.spark.sql.EsSparkSQL$$anonfun$saveToEs$1.apply(EsSparkSQL.scala:97)
at org.elasticsearch.spark.sql.EsSparkSQL$$anonfun$saveToEs$1.apply(EsSparkSQL.scala:97)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Our ElasticSearch cluster is running on AWS so we can't increase the clusters timeout.
Is there another approach we could take to get our updates to ElasticSearch from Spark?

Unfortunately theres not much that can be done other than tuning the performance of the cluster on AWS. You could attempt to lower the number of documents per batch operation with es.batch.size.entries (defaults to 1000), but that may have negative impacts on your update performance.

Thanks James, we'll give it a shot.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.