Hi,
Here's the context:
- I have a database with 1.5M rows of data with unique ids
- The elastic cluster has 3 nodes on elasticsearch 7.7
- The index has 9 shards with1 replica
- I have a python script that get the data from the database and index them in elastic
- The script is multi-processed and do bulk index with 500 documents and sometimes less
In the script, at query time to the DB I have 1.5M unique rows, after splitting this in batch of 500 documents, still have the right amount, and even when I sum all the successful documents indexed I still have the right amount of data. So at this point, after each bulk index, the response is affirming that 500 documents have been indexed.
The script is like so
def import_batch(batch):
# some code here...
es = Elasticsearch(f"{ES_HOST}:{ES_PORT}", retry_on_timeout=True,
max_retries=5, timeout=600)
res = bulk(es, batch, refresh='wait_for')
#in res[0] is stored the number of successful document indexed
#res[0] == len(batch) everytime
return res[0]
def run_imports(batches):
with Pool(4) as p:
nb = p.map(import_batch, batches)
LOG.info('Total Indexed', total=sum(nb))
At the end, the total document indexed is 1.5M, which is expected. However, when doing a count on my index, I have ~1.45M. I ran the process several times and the number of lost documents is random but always around 40-50K.
The query just in case: http://{{ host }}:{{ port }}/my_index/_count
"count": 1454260,
"_shards": {
"total": 9,
"successful": 9,
"skipped": 0,
"failed": 0
}
}
I also tried without the multi-processing lib, I got fewer documents lost, but still around 5-10K.
Any idea why the sum of successful documents indexed and the count of the current documents in the index are not the same?
Thanks.