Request volume management

Hello, I am running an ES 7.6.0 cluster and once in a while I need to perform a full index rebuild.

Currently I need to index 15 million documents and my indexer currently is able to be fine tuned on request size, maximum concurrent calls, delay etc. All index calls happens via bulk insert.

What I am experiencing is no errors on any bulk response but still in the end I have missing documents.

The only way I found to index everything safely was to dramatically reduce both payload size and number of concurrent requests.

I don't think it is feasible for me to use a sort of binary search approach to find the best settings for such task, because no matter what I do the optimal configuration is going to change with an increasing amount of documents.

What is the best way to manage the request volume performed? Is there any way my indexer can be aware of the cluster's condition while operating?
Thanks in advance!

Are you using an official client or are you doing this "by hand" at the HTTP level?

When you say you see no errors do you mean that the top-level bulk response is 200 OK or are you checking the status of each document within the response?

Thanks for the prompt reply.
We are using a custom client which calls the _bulk endpoint.

When you say you see no errors do you mean that the top-level bulk response is 200 OK or are you checking the status of each document within the response?

Both. For every document I also make sure that failed is never greater than 0.

Elasticsearch documentation is not exactly clear in describing what should be the behaviour of the cluster when the volume of requests is too high, therefore I am not sure on what I should be expecting client side when this is going to happen or has happened.
Thanks

It depends a bit on exactly how the cluster is overloaded, but usually docs that couldn't be indexed due to overload would result in a 429 on those specific docs, even if the top-level response is 200.

The only way to get a 200 status for a doc is if it is written successfully to all in-sync copies (primary plus replicas) so those won't be lost. I think there may be something wrong in how you're detecting document-level failures.

Thanks. I will double check if the indexer is missing some 429. In that case I will retry the request and also throttle subsequent requests.

Note that it's not just 429 that you should handle specially. Anything except 200 indicates a problem during indexing. I think other 4xx codes should not be retried since they indicate something is fundamentally wrong with the request, but maybe 5xx codes can be retried since they may be transient server-side issues.

1 Like

Makes perfect sense. Thank you again!