Relationship between Spark tasks and batch size

@costin The Performance guide suggests that the final batch size = batch_size * #of tasks ("Thus for a job with 5 tasks, using the defaults (1mb or 1000 docs) means up to 5mb/5000 docs bulk size"). Could you please explain?

From looking at the code, it seems like EsRDDWriter.write is called for every task, and creates it's own instance of a RestService. Where are batches shared across tasks? Also, does creating a RestService for each task (as opposed to 1 per JVM) impact performance?

I only briefly looked at the code, so I may be completely off. Would really appreciate your help understanding this

You are correct - the writing tasks are isolated from one another and do not share batch writing resources. The "final batch size" in this case is supposed to give you an idea of the total impact on the target Elasticsearch cluster, not per task.

So, If you have 5 tasks, and each task is writing 1mb or 1000 doc batches, then the Elasticsearch cluster will potentially have to process multiple batches at the same time that total up to 5mb/5000docs (5 tasks * 1mb/1000docs) while the Spark job is running.

Hope that helps!

A quote from

If this takes more than 1-2s to be processed, there’s no need to decrease it. If it’s less then that, you can try increasing it in small steps.

  1. How do I see how long each POST takes to respond? In my hadoop logs I've noticed error messages like "Maybe elastic search is overloaded?" but I'm tailing the elaticsearch logs and they remain completely empty. I have marvel running and I see short spikes in CPU and JVM memory but nothing alarming.

  2. Can you confirm that 'B' the configuration in bytes is my sparkConf config? ex: conf.set("es.batch.size.bytes", "15mb")

with a configuration of B bytes

  1. It says you can monitor rejections in Marvel but I don't see that metric? How do I find this?
    Answer from Jun22: Seeing Indexing Rejections

In such a scenario, monitor Elasticsearch (through Marvel or other plugins) and keep an eye on bulk processing. Look at the percentage of documents being rejected; it is perfectly fine to have some documents rejected but anything higher then 10-15% on a regular basis is a good indication the cluster is overloaded.