Document lost or not indexed during bulk index

nerd2000 · June 11, 2020, 10:53pm

Hi,
Here's the context:

I have a database with 1.5M rows of data with unique ids
The elastic cluster has 3 nodes on elasticsearch 7.7
The index has 9 shards with1 replica
I have a python script that get the data from the database and index them in elastic
The script is multi-processed and do bulk index with 500 documents and sometimes less

In the script, at query time to the DB I have 1.5M unique rows, after splitting this in batch of 500 documents, still have the right amount, and even when I sum all the successful documents indexed I still have the right amount of data. So at this point, after each bulk index, the response is affirming that 500 documents have been indexed.

The script is like so

def import_batch(batch):
    # some code here...
    es = Elasticsearch(f"{ES_HOST}:{ES_PORT}", retry_on_timeout=True,
                       max_retries=5, timeout=600)
    res = bulk(es, batch, refresh='wait_for')
    #in res[0] is stored the number of successful document indexed
    #res[0] == len(batch) everytime
    return res[0]


def run_imports(batches):
    with Pool(4) as p:
        nb = p.map(import_batch, batches)
        LOG.info('Total Indexed', total=sum(nb))

At the end, the total document indexed is 1.5M, which is expected. However, when doing a count on my index, I have ~1.45M. I ran the process several times and the number of lost documents is random but always around 40-50K.

The query just in case: http://{{ host }}:{{ port }}/my_index/_count

  "count": 1454260,
  "_shards": {
    "total": 9,
    "successful": 9,
    "skipped": 0,
    "failed": 0
  }
}

I also tried without the multi-processing lib, I got fewer documents lost, but still around 5-10K.

Any idea why the sum of successful documents indexed and the count of the current documents in the index are not the same?

Thanks.

warkolm · June 15, 2020, 11:27pm

If things are not being indexed, what do the logs show about why? The bulk response should give you a reason.

nerd2000 · June 18, 2020, 4:42pm

The logs don't show anything wrong, I always have 0 errors and the list of indexed documents. As said in the post, when comparing the number of indexed documents and the number of document sent in the bulk request, they always match.

Here's an extract

{"index":{"_id":10749879,"_index":"index_name"}}
{mydocument}
...


{
    "took": 629,
    "errors": false,
    "items": [{
        "index": {
            "_index": "index_name",
            "_type": "_doc",
            "_id": "10750755",
            "_version": 9,
            "result": "updated",
            "_shards": {
              "total": 1,
              "successful": 1,
              "failed": 0
              },
            "_seq_no": 6340195,
            "_primary_term": 16,
            "status": 200
        }
    },
    ...]
}

nerd2000 · June 25, 2020, 9:23am

I found the issue, it was in my sql query.
This post can be deleted if needed.

system · July 23, 2020, 9:23am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Missing documents after a bulk index Elasticsearch	13	3372	July 6, 2017
Elasticsearch bulk index missing some records Elasticsearch	18	3816	August 2, 2018
Index not refreshed. Query count result not accurate Elasticsearch	7	1879	July 6, 2017
Missing documents after bulk indexing Elasticsearch	13	17785	July 6, 2017
Elasticsearch not showing correct count of documents in index Elasticsearch	10	670	June 13, 2023

Document lost or not indexed during bulk index

Related topics