ElasticSearch missing documents by list of id's as size of list increases

Copying over from SO.

I am querying ElasticSearch by a long list of id's. I usually break this list down to chunks and then perform the query with a subset of the list and aggregate the results later. ES seems to be missing documents as I scale up my size of list, e.g. for a window size of 1000 ids, it misses 49 documents. For a window of 2000 ids, it misses 93 documents. Note that I am not changing the list, just the window size and hence the number of queries to traverse through the entire list. When I manually create a list of ids that did not return anything and then use 'this' list on the same query, I get back results. I know these ids actually exist in ES because I get all results when I have a window size of 1, i.e. I hit ES once for every single query. Has anyone faced an issue like this? Any idea on what could cause this?

I have also tried using scan method with similar results.

The objects in ES in my case have a nested field called passage which in consists of other fields e.g. id and other_field.

My code:

es = Elasticsearch([f'{os.getenv("ES_HOST")}'],
                   http_auth=(f'{os.getenv("ES_USERNAME")}', f'{os.getenv("ES_PASSWORD")}'),
                   port=f'{os.getenv("ES_PORT")}')
results_dict = {}
window = 1
start = 0
end = 0
with tqdm(total=len(passage_ids), desc="Processing ES passage results...") as pbar:
    for _ in range(start, len(passage_ids), window):
        end = start + window if start + window < len(passage_ids) else len(passage_ids)
        subset_passage_ids = passage_ids[start:end]
        query_dict = {
                    "_source": [
                        "_id",
                        "court.id",
                        "court.level",
                        "court.federal"
                    ],
                    "query": {
                        "nested": {
                            "path": "passages",
                            "query": {
                                "terms": {
                                    "passages.id": subset_passage_ids
                                }
                            },
                            "inner_hits": {
                                "_source": [
                                    "passages.id",
                                    "passages.body"
                                ]
                            }
                        }
                    },
                    "size": window
                }
        res = es.search(index='INDEX_NAME', body=query_dict)
        print("Got %d Hits:" % res['hits']['total']['value'])
        # res = elasticsearch.helpers.scan(es,
        #                                  index='INDEX_NAME',
        #                                  query=query_dict,
        #                                  preserve_order=True
        #                                  )
        for q in res['hits']['hits']:
            <DO SOME PROCESSING>

        pbar.update(end - start)
        start = end

I can't think of any reason why I don't get all the results in one go.
Some stats I drew up as I was varying window with a passage_ids list of size 10262 after consuming the entire passage_ids list.

w: window_size
d: documents not retrieved

w = 1
d = (10262-10262) = 0
d/w = 0

w = 2
d = (10262-10262) = 0
d/w

w = 150
d = (10262-10252) = 10
d/w = 0.07

w = 500
d = (10262-10241) = 21
d/w = 0.04

w = 1000
d = (10262-10213) = 49
d/w = 0.05

w = 2000
d = (10262-10169) = 93
d/w = 0.05

w = 3000
d = (10262-10124) = 138
d/w = 0.05

w = 5000
d = (10262-10041) = 221
d/w = 0.04

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.