Find out when an index has "settled" after a series of bulk inserts?

I am doing multiple bulk inserts using the _bulk endpoint, using a Rust module (not yet concurrent, but it will be).

On my test files this should generate 9264 "Lucene documents", i.e. documents in the index. I check this using the _count endpoint. NB this is back in the Python module which called the Rust "bulk indexer" module.

For the avoidance of doubt, I'm NOT talking here about requests which time out... All these _bulk requests are returning fine. NB to clarify, because this might be significant, to do my _bulk inserts I am using the reqwest::blocking::Client in Rust, not the reqwest::Client (with async methods). But if anything, the blocking version should return with greater delays.

What I find, once back in Python, when in theory all my 9264 LDocs have been bulk-inserted successfully, is that unless I leave a significant gap, say 3 seconds, before checking _count, I get an arbitrary lower figure than 9264, such as 8340 or 7633.

My understanding is that the _tasks endpoint may well be the way to go: i.e. do a loop to check every 0.1s or so to see whether the task(s) in question (multiple bulk posts) have all ended. But I'm not sure how to use _tasks to get the information I want: I have tried these 3 calls:

success, deliverable = process_json_request(f'{ES_URL}/_tasks?actions=indices:data/write/bulk*&detailed&format=json')
success, deliverable = process_json_request(f'{ES_URL}/_tasks?detailed&format=json')
success, deliverable = process_json_request(f'{ES_URL}/_tasks?detailed=true&actions=*reindex&format=json')

This utility method calls requests.request(...) in Python.

None of the above 3 URLs seems to deliver what I'm looking for.

Does anyone know how to check that "bulk insertion activity has ceased" and that the index has returned to an "idle" state ... ?

The bulk insertion activity ceases before the responses to the POST _bulk API are sent (assuming no failures or network outages etc anyway). But GET _count reports the number of docs exposed to searches (i.e. refreshed) so you need to ensure a refresh has happened after the bulk indexing operations too. Either wait for a periodic refresh, or include the ?refresh parameter on the last POST _bulk API call, or call the POST _refresh API explicitly.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.