I am interested in optimizing the reading and writing performances in a system that works with Spark 1.6.2 and ES 2.4. I have read the existing posts & docs but there some minor moments I would like to clarify.
I am looking into the connector configuration. In particular, the following parameters:
"es.scroll.size"
"es.batch.size.bytes"
"es.batch.size.entries"
I have three questions.
The meaning of the "es.scroll.size" param
The documentation says: "the total number of documents returned is LIMIT * NUMBER_OF_SCROLLS (OR TASKS)".
Am I right that if I have 4 shards in my ES, then I will have 4 scrolls (tasks) with multiple requests per scroll, where each request will contain "es.scroll.size" documents?
Impact of the "es.batch.size.bytes" and "es.batch.size.entries" params.
I have tried the next values: (5mb, 5.000), (10mb, 10.000), (15mb, 15.000), (25mb, 25.000). The first two give me nice performance benefits. However, the performance is roughly the same among the last three pairs.
Is there any limit on 10.000 for the "es.batch.size.entries" attribute?
Are there any other attributes to improve the reading/writing performance?
Each task opens it's own scroll request against the shard data it is assigned to read. The es.scroll.size property defines how many documents to return from each call to scroll. If the number is larger than the number of documents in the specific scroll, then there will only be one call, if it's half the size, there will be two calls to the same scroll, etc...
When we say "the total number of documents returned" in the documentation, what we mean is the total number of documents that Elasticsearch will have to serve up to worker tasks at any given time.
I'm not sure what you mean by "Is there any limit on 10.000 for the 'es.batch.size.entries' attribute". We don't impose any limits to that value if you choose to configure it higher. The default value for it is 1000. You may be butting up against limitations in indexing speed for Elasticsearch if the last set of batching properties are not increasing the speed of processing any further. I would take a look at instrumenting, measuring, and improving performance on the Elasticsearch side as well.
In ES-Hadoop 5.0 we have added support for sliced scrolling in Elasticsearch 5.0. Sliced scrolling is a new feature that allows a scroll to be divided into mostly equal parts by only sending records that hash into the scroll slice that you are reading from. This allows us to sub divide scroll read requests further, and allow more reader tasks to consume from Elasticsearch at a given time. The new property in ES-Hadoop 5.0 that controls this is es.input.max.docs.per.partition (more info here).
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.