Bulk write to ES | best practices

Hi Costin,

Just wanted to check with you on the best configuration for bulk update. I went through previously asked questions at:


https://groups.google.com/forum/#!msg/elasticsearch/Zi_0RsgHvnU/gXADeboVvbAJ

FYI:
Elasticsearch version: 1.6~
I have a 32 shards 3 node cluster each shard containing some 1 million records. Requirement is to fetch update and index the documents.

Questions around fetching:
1.) I tried fetching documents with
es.scroll.size = 1000 and 3000
Surprisingly results were better for 1000. Why is it so? Also, I assume that since each partition creates it's own search request which means it's own scroll id and if my 5 partitions are concurrently hitting ES, what I am asking for is 5*1000 records concurrently. What is the most optimum number for getting such huge dataset which reduces the execution time of spark job and at the same time does not affect ES.

Questions around Indexing:
1.) As in one of the post you mentioned that the bulk size should be the one which happens in 1-2 seconds, with a size of 1 million document to be updated by each task(32 in my case) its taking about the same time if I try to increase the count in properties: es.batch.size.bytes and es.batch.size.entries. Playing with them hardly affects the total time. What's your suggestion here?
2.) es.batch.write.refresh - Does this property disables the refresh time before the bulk indexing and updates it after the indexing is over?

Thanks in advance
Piyush

Assuming you already read the docs:

  1. Asking more data doesn't necessarily means better performance. You are saying that 5K docs at once works better than 15K. It might be very well be the case since it's not just Elasticsearch that you need to consider but also the fact that these docs are buffered in memory by the connector and read by Spark.

  2. First off, the es.batch. options affect writing not reading. And make sure both of them are high enough since the first one that matches applies. In other words if you keep the number of entries low and only increase the number of bytes, you won't see a difference since the size.entries applies first.

  3. This applies the refresh after the indexing per task is done. In other words it will occur at the end - note that typically one doesn't need it and in fact, it might be counter-productive when dealing with a large number of tasks. There's an issue raised to address this in the next version.

Thanks Costin.

Regarding points 1 and 2, I guess the best option is to play with numbers and see what's the best configuration. Regarding point 3:

ES documentation says that:

"If you don’t need near real-time accuracy on your search results, consider
dropping the index.refresh_interval of each index to 30s. If you are doing
a large import, you can disable refreshes by setting this value to -1 for the
duration of the import. Don’t forget to reenable it when you are finished!"

Refresh after each task as you mentioned might be counter productive. I was thinking if this particular configuration actually disables the refresh till the time bulk upload is finished by all the task.

The connector does not disable refresh and then re-enables it. It's a setting that has quite an impact and with each version of elastic, it is less and less needed. If one wants to use it, it can easily do so by wrapping the job with two simple REST commands that disable and re-enable the refresh.

A principle that the connector applies is to interact/modify the index settings as little as possible; it is a pipe to Elastic not an abstraction/mapper for it.