Relationship between Spark tasks and batch size

eugene_miretsky · July 14, 2016, 9:47pm

@costin The Performance guide suggests that the final batch size = batch_size * #of tasks ("Thus for a job with 5 tasks, using the defaults (1mb or 1000 docs) means up to 5mb/5000 docs bulk size"). Could you please explain?

From looking at the code, it seems like EsRDDWriter.write is called for every task, and creates it's own instance of a RestService. Where are batches shared across tasks? Also, does creating a RestService for each task (as opposed to 1 per JVM) impact performance?

p.s
I only briefly looked at the code, so I may be completely off. Would really appreciate your help understanding this

james.baiera · July 15, 2016, 2:45pm

You are correct - the writing tasks are isolated from one another and do not share batch writing resources. The "final batch size" in this case is supposed to give you an idea of the total impact on the target Elasticsearch cluster, not per task.

So, If you have 5 tasks, and each task is writing 1mb or 1000 doc batches, then the Elasticsearch cluster will potentially have to process multiple batches at the same time that total up to 5mb/5000docs (5 tasks * 1mb/1000docs) while the Spark job is running.

Hope that helps!

jspooner · July 24, 2016, 3:20pm

A quote from Performance considerations | Elasticsearch for Apache Hadoop [8.11] | Elastic

If this takes more than 1-2s to be processed, there’s no need to decrease it. If it’s less then that, you can try increasing it in small steps.

How do I see how long each POST takes to respond? In my hadoop logs I've noticed error messages like "Maybe Elasticsearch is overloaded?" but I'm tailing the elaticsearch logs and they remain completely empty. I have marvel running and I see short spikes in CPU and JVM memory but nothing alarming.
Can you confirm that 'B' the configuration in bytes is my sparkConf config? ex: conf.set("es.batch.size.bytes", "15mb")

with a configuration of B bytes

It says you can monitor rejections in Marvel but I don't see that metric? How do I find this?
Answer from Jun22: Seeing Indexing Rejections

In such a scenario, monitor Elasticsearch (through Marvel or other plugins) and keep an eye on bulk processing. Look at the percentage of documents being rejected; it is perfectly fine to have some documents rejected but anything higher then 10-15% on a regular basis is a good indication the cluster is overloaded.

Topic		Replies	Views
Spark Reading/Writing Performance Elasticsearch es-hadoop	2	1595	July 6, 2017
Spark + Elastic search write performance issue Elasticsearch es-hadoop	2	2515	November 28, 2017
Bulk write to ES \| best practices Elasticsearch es-hadoop	4	5574	July 6, 2017
Performance degradation when writing to AWS elasticsearch using elasticsearch-hadoop library Elasticsearch es-hadoop	6	2044	July 6, 2017
ES - Spark tuning for bulk writes Elasticsearch es-hadoop	17	2776	January 24, 2021

Relationship between Spark tasks and batch size

Related topics