Index does not refresh on time

Hello everyone.

I am using the Python API for Elasticsearch to create indices and insert the data from multiple csv files into those indices. I am also creating the index patterns using python and Kibana API.

While inserting each row from the csv file into Elasticsearch, I am assigning it a doc_id which is an integer based on the value of doc_id of the last document present in that particular index. So, how it works is that if the specified index does not exist, it creates the index and starts inserting documents with doc_ids starting from 1 and keeps incrementing the id by 1 with each insert. On the other hand, if the index exists, it checks for the doc_id of the last document present in the specified index and starts from "doc_id+1" and keeps incrementing by 1 for each insert thereafter.

This works fine if the number of csv files is not too many or if the csv files themselves don't have too many rows of data. However, for large number of files, this sometimes results in indices with fewer documents than actually present in the csv files. The problem seems to be in the part to get the last doc_id. My guess is that the index is unable to refresh in time and hence, instead of getting the last id, it ends up getting one in the middle and all the documents thereafter are overwritten resulting in fewer documents in Elasticsearch than present in the csv files.

Does anyone have any idea about how I may solve this problem? Is there any setting that I need to change or tweak. Do I need to add delays before the insertion of the documents? If so please guide me. Your help will be appreciated.

Thanks & regards,
Abhishek Das

Hi,
It is likely that your hypothesis is right: Elasticsearch is not meant for transaction, and refresh is a "best effort" time objective.
In the first place, do you really need the id to be an incremental counter? Otherwise, if you just need to put the content of many files in an index, just send documents without id and let Elasticsearch automatically generate the _id. No doc will be overwritten this way.
If you want to manage concurrency, you may have a look on optimistic concurrency control

Thank you @vincenbr! I'll take a look into it. At the moment, it is imperative that the ids is a numeric sequence and hence, this approach.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.