I am using the Python API for Elasticsearch to create indices and insert the data from multiple csv files into those indices. I am also creating the index patterns using python and Kibana API.
While inserting each row from the csv file into Elasticsearch, I am assigning it a doc_id which is an integer based on the value of doc_id of the last document present in that particular index. So, how it works is that if the specified index does not exist, it creates the index and starts inserting documents with doc_ids starting from 1 and keeps incrementing the id by 1 with each insert. On the other hand, if the index exists, it checks for the doc_id of the last document present in the specified index and starts from "doc_id+1" and keeps incrementing by 1 for each insert thereafter.
This works fine if the number of csv files is not too many or if the csv files themselves don't have too many rows of data. However, for large number of files, this sometimes results in indices with fewer documents than actually present in the csv files. The problem seems to be in the part to get the last doc_id. My guess is that the index is unable to refresh in time and hence, instead of getting the last id, it ends up getting one in the middle and all the documents thereafter are overwritten resulting in fewer documents in Elasticsearch than present in the csv files.
Does anyone have any idea about how I may solve this problem? Is there any setting that I need to change or tweak. Do I need to add delays before the insertion of the documents? If so please guide me. Your help will be appreciated.
Thanks & regards,