Best values for keep alive (search scroll api) and batch size (bulk api)

I need to reindex data in my ES index (150 million documents). I am going t use search and scroll API to get the data from old index it (I cannot use reindex API, because we have old ES version) and bulk API to copy it on the new index.
My question is, what is the most optimized values I should put for “keep alive” time of scroll search ? (1 min, 2 min ? Or 5 mins?)
And what is the best value for batch size for bulk request? (1000 elements of more ?) How many documents I should copy in one bulk requests?

I have 4 shards in the index

1 Like

Hi Natalia,

The scroll parameter (what you called keep alive) is just the amount of time Elasticsearch will keep your scroll context alive if no other call to _search/scroll is made. You don't have to optimize it. If you are fast enough to make a second call before 1 minute you use 1m :slight_smile:

To optimize the batch size of bulk request there is an interesting article in the docs. It says:

Fortunately, it is easy to find this sweet spot: Try indexing typical documents in batches of increasing size. When performance starts to drop off, your batch size is too big. A good place to start is with batches of 1,000 to 5,000 documents or, if your documents are very large, with even smaller batches.

It is often useful to keep an eye on the physical size of your bulk requests. One thousand 1KB documents is very different from one thousand 1MB documents. A good bulk size to start playing with is around 5-15MB in size.

So it depends of the size of your documents. Try to find the number of documents that corresponds to about 5-15mb and measure the performance. After you are comfortable with the size you choose start indexing.

Hope it helps!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.