saveToEs Write performance (elasticsearch-spark)



I have a question about parallelism when saving an RDD to Elasticsearch.
I have an RDD (created with SparkSQL) with 1000 partitions, and an Elasticsearch index with 5 primary shards. I run my application on a Spark cluster with 3 executors.
However, I only see one task (running on one executor) when calling saveToEs, though I would expect it to write in parallel.
What is going wrong there?

(Pat Humphreys) #2

Did you ever get to the bottom of this issue? I am seeing the same thing


Hi @Pat_Humphreys,

see this answer:

I ended up using saveToEsWithMeta, configuration including at least values for es.batch.size.entries, es.batch.size.bytes, es.batch.write.retry.count.

(system) #4