Document lost or not indexed during bulk index

Hi,
Here's the context:

  • I have a database with 1.5M rows of data with unique ids
  • The elastic cluster has 3 nodes on elasticsearch 7.7
  • The index has 9 shards with1 replica
  • I have a python script that get the data from the database and index them in elastic
  • The script is multi-processed and do bulk index with 500 documents and sometimes less

In the script, at query time to the DB I have 1.5M unique rows, after splitting this in batch of 500 documents, still have the right amount, and even when I sum all the successful documents indexed I still have the right amount of data. So at this point, after each bulk index, the response is affirming that 500 documents have been indexed.

The script is like so

def import_batch(batch):
    # some code here...
    es = Elasticsearch(f"{ES_HOST}:{ES_PORT}", retry_on_timeout=True,
                       max_retries=5, timeout=600)
    res = bulk(es, batch, refresh='wait_for')
    #in res[0] is stored the number of successful document indexed
    #res[0] == len(batch) everytime
    return res[0]


def run_imports(batches):
    with Pool(4) as p:
        nb = p.map(import_batch, batches)
        LOG.info('Total Indexed', total=sum(nb))

At the end, the total document indexed is 1.5M, which is expected. However, when doing a count on my index, I have ~1.45M. I ran the process several times and the number of lost documents is random but always around 40-50K.

The query just in case: http://{{ host }}:{{ port }}/my_index/_count

  "count": 1454260,
  "_shards": {
    "total": 9,
    "successful": 9,
    "skipped": 0,
    "failed": 0
  }
}

I also tried without the multi-processing lib, I got fewer documents lost, but still around 5-10K.

Any idea why the sum of successful documents indexed and the count of the current documents in the index are not the same?

Thanks.

If things are not being indexed, what do the logs show about why? The bulk response should give you a reason.

The logs don't show anything wrong, I always have 0 errors and the list of indexed documents. As said in the post, when comparing the number of indexed documents and the number of document sent in the bulk request, they always match.

Here's an extract

{"index":{"_id":10749879,"_index":"index_name"}}
{mydocument}
...


{
    "took": 629,
    "errors": false,
    "items": [{
        "index": {
            "_index": "index_name",
            "_type": "_doc",
            "_id": "10750755",
            "_version": 9,
            "result": "updated",
            "_shards": {
              "total": 1,
              "successful": 1,
              "failed": 0
              },
            "_seq_no": 6340195,
            "_primary_term": 16,
            "status": 200
        }
    },
    ...]
}

I found the issue, it was in my sql query.
This post can be deleted if needed.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.