ElasticSearch missing documents by list of id's as size of list increases

Sakib_Rahman · August 1, 2019, 5:54pm

Copying over from SO.

I am querying ElasticSearch by a long list of id's. I usually break this list down to chunks and then perform the query with a subset of the list and aggregate the results later. ES seems to be missing documents as I scale up my size of list, e.g. for a window size of 1000 ids, it misses 49 documents. For a window of 2000 ids, it misses 93 documents. Note that I am not changing the list, just the window size and hence the number of queries to traverse through the entire list. When I manually create a list of ids that did not return anything and then use 'this' list on the same query, I get back results. I know these ids actually exist in ES because I get all results when I have a window size of 1, i.e. I hit ES once for every single query. Has anyone faced an issue like this? Any idea on what could cause this?

I have also tried using scan method with similar results.

The objects in ES in my case have a nested field called passage which in consists of other fields e.g. id and other_field.

My code:

es = Elasticsearch([f'{os.getenv("ES_HOST")}'],
                   http_auth=(f'{os.getenv("ES_USERNAME")}', f'{os.getenv("ES_PASSWORD")}'),
                   port=f'{os.getenv("ES_PORT")}')
results_dict = {}
window = 1
start = 0
end = 0
with tqdm(total=len(passage_ids), desc="Processing ES passage results...") as pbar:
    for _ in range(start, len(passage_ids), window):
        end = start + window if start + window < len(passage_ids) else len(passage_ids)
        subset_passage_ids = passage_ids[start:end]
        query_dict = {
                    "_source": [
                        "_id",
                        "court.id",
                        "court.level",
                        "court.federal"
                    ],
                    "query": {
                        "nested": {
                            "path": "passages",
                            "query": {
                                "terms": {
                                    "passages.id": subset_passage_ids
                                }
                            },
                            "inner_hits": {
                                "_source": [
                                    "passages.id",
                                    "passages.body"
                                ]
                            }
                        }
                    },
                    "size": window
                }
        res = es.search(index='INDEX_NAME', body=query_dict)
        print("Got %d Hits:" % res['hits']['total']['value'])
        # res = elasticsearch.helpers.scan(es,
        #                                  index='INDEX_NAME',
        #                                  query=query_dict,
        #                                  preserve_order=True
        #                                  )
        for q in res['hits']['hits']:
            <DO SOME PROCESSING>

        pbar.update(end - start)
        start = end

I can't think of any reason why I don't get all the results in one go.
Some stats I drew up as I was varying window with a passage_ids list of size 10262 after consuming the entire passage_ids list.

w: window_size
d: documents not retrieved

w = 1
d = (10262-10262) = 0
d/w = 0

w = 2
d = (10262-10262) = 0
d/w

w = 150
d = (10262-10252) = 10
d/w = 0.07

w = 500
d = (10262-10241) = 21
d/w = 0.04

w = 1000
d = (10262-10213) = 49
d/w = 0.05

w = 2000
d = (10262-10169) = 93
d/w = 0.05

w = 3000
d = (10262-10124) = 138
d/w = 0.05

w = 5000
d = (10262-10041) = 221
d/w = 0.04

system · August 29, 2019, 5:54pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
IDs query returning incomplete results Elasticsearch	6	880	May 16, 2018
Document exist check Elasticsearch	5	3031	July 5, 2017
Query terms search miss content depending on what is searched Elasticsearch	4	278	May 13, 2021
Elasticsearch bulk index missing some records Elasticsearch	18	3819	August 2, 2018
ElasticSearch losing documents Elasticsearch	13	1099	March 17, 2023

ElasticSearch missing documents by list of id's as size of list increases

Related topics