saveToEs Write performance (elasticsearch-spark)

larghir · July 7, 2016, 1:27pm

Hi,

I have a question about parallelism when saving an RDD to Elasticsearch.
I have an RDD (created with SparkSQL) with 1000 partitions, and an Elasticsearch index with 5 primary shards. I run my application on a Spark cluster with 3 executors.
However, I only see one task (running on one executor) when calling saveToEs, though I would expect it to write in parallel.
What is going wrong there?

Pat_Humphreys · September 15, 2016, 4:22pm

Did you ever get to the bottom of this issue? I am seeing the same thing

larghir · September 22, 2016, 11:25am

Hi @Pat_Humphreys,

see this answer:

I ended up using saveToEsWithMeta, configuration including at least values for es.batch.size.entries, es.batch.size.bytes, es.batch.write.retry.count.

Topic		Replies	Views
Performance degradation when writing to AWS elasticsearch using elasticsearch-hadoop library Elasticsearch es-hadoop	6	2080	July 6, 2017
Spark uses one ES node at a time to write to elastic search Elasticsearch es-hadoop	4	1849	November 8, 2017
Throttling indexing to Elasticsearch in Spark Elasticsearch es-hadoop	10	2162	July 6, 2017
Difference between task creation for a write and read-update-write operation in ES Elasticsearch es-hadoop	3	1464	July 6, 2017
Elasticsearch + Spark read performance issues Elasticsearch es-hadoop	3	2311	May 24, 2016

saveToEs Write performance (elasticsearch-spark)

Related topics