Index does not refresh on time

sunny3794.ad · February 25, 2022, 1:05pm

Hello everyone.

I am using the Python API for Elasticsearch to create indices and insert the data from multiple csv files into those indices. I am also creating the index patterns using python and Kibana API.

While inserting each row from the csv file into Elasticsearch, I am assigning it a doc_id which is an integer based on the value of doc_id of the last document present in that particular index. So, how it works is that if the specified index does not exist, it creates the index and starts inserting documents with doc_ids starting from 1 and keeps incrementing the id by 1 with each insert. On the other hand, if the index exists, it checks for the doc_id of the last document present in the specified index and starts from "doc_id+1" and keeps incrementing by 1 for each insert thereafter.

This works fine if the number of csv files is not too many or if the csv files themselves don't have too many rows of data. However, for large number of files, this sometimes results in indices with fewer documents than actually present in the csv files. The problem seems to be in the part to get the last doc_id. My guess is that the index is unable to refresh in time and hence, instead of getting the last id, it ends up getting one in the middle and all the documents thereafter are overwritten resulting in fewer documents in Elasticsearch than present in the csv files.

Does anyone have any idea about how I may solve this problem? Is there any setting that I need to change or tweak. Do I need to add delays before the insertion of the documents? If so please guide me. Your help will be appreciated.

Thanks & regards,
Abhishek Das

vincenbr · February 25, 2022, 3:46pm

Hi,
It is likely that your hypothesis is right: Elasticsearch is not meant for transaction, and refresh is a "best effort" time objective.
In the first place, do you really need the id to be an incremental counter? Otherwise, if you just need to put the content of many files in an index, just send documents without id and let Elasticsearch automatically generate the _id. No doc will be overwritten this way.
If you want to manage concurrency, you may have a look on optimistic concurrency control

sunny3794.ad · February 25, 2022, 4:10pm

Thank you @vincenbr! I'll take a look into it. At the moment, it is imperative that the ids is a numeric sequence and hence, this approach.

system · March 25, 2022, 4:10pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Insert and Update records in Elasticsearch Elasticsearch	7	1359	June 11, 2021
Elasticsearch refreshing indices, but documents still don't show up in search Elasticsearch	3	222	December 19, 2022
Document version number not advancing Elasticsearch	9	2537	February 15, 2018
Elasticsearch does not index all documents Elasticsearch	4	500	June 26, 2018
When i search for document after index a document. It returns empty but it returns document with sleep of 2 seconds between creating and fetching Elasticsearch	5	459	August 22, 2018

Index does not refresh on time

Related topics