Copying over from SO.
I am querying ElasticSearch by a long list of id's. I usually break this list down to chunks and then perform the query with a subset of the list and aggregate the results later. ES seems to be missing documents as I scale up my size of list, e.g. for a window size of 1000 ids, it misses 49 documents. For a window of 2000 ids, it misses 93 documents. Note that I am not changing the list, just the window size and hence the number of queries to traverse through the entire list. When I manually create a list of ids that did not return anything and then use 'this' list on the same query, I get back results. I know these ids actually exist in ES because I get all results when I have a window size of 1, i.e. I hit ES once for every single query. Has anyone faced an issue like this? Any idea on what could cause this?
I have also tried using scan
method with similar results.
The objects in ES in my case have a nested field called passage
which in consists of other fields e.g. id
and other_field
.
My code:
es = Elasticsearch([f'{os.getenv("ES_HOST")}'],
http_auth=(f'{os.getenv("ES_USERNAME")}', f'{os.getenv("ES_PASSWORD")}'),
port=f'{os.getenv("ES_PORT")}')
results_dict = {}
window = 1
start = 0
end = 0
with tqdm(total=len(passage_ids), desc="Processing ES passage results...") as pbar:
for _ in range(start, len(passage_ids), window):
end = start + window if start + window < len(passage_ids) else len(passage_ids)
subset_passage_ids = passage_ids[start:end]
query_dict = {
"_source": [
"_id",
"court.id",
"court.level",
"court.federal"
],
"query": {
"nested": {
"path": "passages",
"query": {
"terms": {
"passages.id": subset_passage_ids
}
},
"inner_hits": {
"_source": [
"passages.id",
"passages.body"
]
}
}
},
"size": window
}
res = es.search(index='INDEX_NAME', body=query_dict)
print("Got %d Hits:" % res['hits']['total']['value'])
# res = elasticsearch.helpers.scan(es,
# index='INDEX_NAME',
# query=query_dict,
# preserve_order=True
# )
for q in res['hits']['hits']:
<DO SOME PROCESSING>
pbar.update(end - start)
start = end
I can't think of any reason why I don't get all the results in one go.
Some stats I drew up as I was varying window
with a passage_ids
list of size 10262
after consuming the entire passage_ids
list.
w: window_size
d: documents not retrieved
w = 1
d = (10262-10262) = 0
d/w = 0
w = 2
d = (10262-10262) = 0
d/w
w = 150
d = (10262-10252) = 10
d/w = 0.07
w = 500
d = (10262-10241) = 21
d/w = 0.04
w = 1000
d = (10262-10213) = 49
d/w = 0.05
w = 2000
d = (10262-10169) = 93
d/w = 0.05
w = 3000
d = (10262-10124) = 138
d/w = 0.05
w = 5000
d = (10262-10041) = 221
d/w = 0.04