Just wanted to check with you on the best configuration for bulk update. I went through previously asked questions at:
Elasticsearch version: 1.6~
I have a 32 shards 3 node cluster each shard containing some 1 million records. Requirement is to fetch update and index the documents.
Questions around fetching:
1.) I tried fetching documents with
es.scroll.size = 1000 and 3000
Surprisingly results were better for 1000. Why is it so? Also, I assume that since each partition creates it's own search request which means it's own scroll id and if my 5 partitions are concurrently hitting ES, what I am asking for is 5*1000 records concurrently. What is the most optimum number for getting such huge dataset which reduces the execution time of spark job and at the same time does not affect ES.
Questions around Indexing:
1.) As in one of the post you mentioned that the bulk size should be the one which happens in 1-2 seconds, with a size of 1 million document to be updated by each task(32 in my case) its taking about the same time if I try to increase the count in properties: es.batch.size.bytes and es.batch.size.entries. Playing with them hardly affects the total time. What's your suggestion here?
2.) es.batch.write.refresh - Does this property disables the refresh time before the bulk indexing and updates it after the indexing is over?
Thanks in advance