Spark Reading/Writing Performance

aokolnychyi · November 4, 2016, 12:47pm

Hi,

I am interested in optimizing the reading and writing performances in a system that works with Spark 1.6.2 and ES 2.4. I have read the existing posts & docs but there some minor moments I would like to clarify.

I am looking into the connector configuration. In particular, the following parameters:

"es.scroll.size"
"es.batch.size.bytes"
"es.batch.size.entries"

I have three questions.

The meaning of the "es.scroll.size" param
The documentation says: "the total number of documents returned is LIMIT * NUMBER_OF_SCROLLS (OR TASKS)".
Am I right that if I have 4 shards in my ES, then I will have 4 scrolls (tasks) with multiple requests per scroll, where each request will contain "es.scroll.size" documents?
Impact of the "es.batch.size.bytes" and "es.batch.size.entries" params.
I have tried the next values: (5mb, 5.000), (10mb, 10.000), (15mb, 15.000), (25mb, 25.000). The first two give me nice performance benefits. However, the performance is roughly the same among the last three pairs.
Is there any limit on 10.000 for the "es.batch.size.entries" attribute?
Are there any other attributes to improve the reading/writing performance?

Thanks,
Anton

james.baiera · November 10, 2016, 5:58pm

Each task opens it's own scroll request against the shard data it is assigned to read. The es.scroll.size property defines how many documents to return from each call to scroll. If the number is larger than the number of documents in the specific scroll, then there will only be one call, if it's half the size, there will be two calls to the same scroll, etc...

When we say "the total number of documents returned" in the documentation, what we mean is the total number of documents that Elasticsearch will have to serve up to worker tasks at any given time.

I'm not sure what you mean by "Is there any limit on 10.000 for the 'es.batch.size.entries' attribute". We don't impose any limits to that value if you choose to configure it higher. The default value for it is 1000. You may be butting up against limitations in indexing speed for Elasticsearch if the last set of batching properties are not increasing the speed of processing any further. I would take a look at instrumenting, measuring, and improving performance on the Elasticsearch side as well.

In ES-Hadoop 5.0 we have added support for sliced scrolling in Elasticsearch 5.0. Sliced scrolling is a new feature that allows a scroll to be divided into mostly equal parts by only sending records that hash into the scroll slice that you are reading from. This allows us to sub divide scroll read requests further, and allow more reader tasks to consume from Elasticsearch at a given time. The new property in ES-Hadoop 5.0 that controls this is es.input.max.docs.per.partition (more info here).

Topic		Replies	Views
Correct setting of "es.scroll.size" with for optimal Spark read performance Elasticsearch es-hadoop	2	3402	July 27, 2017
Read from spark big index and memory usage Elasticsearch es-hadoop	6	1176	April 1, 2022
ES - Spark tuning for bulk writes Elasticsearch es-hadoop	17	2708	January 24, 2021
Bulk write to ES \| best practices Elasticsearch es-hadoop	4	5525	July 6, 2017
Spark + Elastic search write performance issue Elasticsearch es-hadoop	2	2501	November 28, 2017

Spark Reading/Writing Performance

Related topics