I have the following Python code to query my ES cluster (v8.8.0).
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search
for i in range(0, 100):
s = Search(using=es, index=index_name) \
.filter('range', **{'@timestamp': {'gte': start_datetime, 'lt': end_datetime, 'format':
'strict_date_optional_time_nanos'}})
if s.count() > 0:
results = s.scan()
for result in results:
result_json = result.to_dict()
res1, res2 = process_result(result_json)
The number of documents returned in each scan is between 1-2M.
During one of the queries, I got an error on ES:
circuit_breaking_exception: [parent] Data too large, data for [<http_request>] would be [32283148058/30gb], which is larger than the limit of [31621696716/29.4gb], real usage: [32283146480/30gb], new bytes reserved: [1578/1.5kb]...
It seems like ES ran out of memory when executing the query. Is there something I can specify in my query so as to avoid this issue?
I'm not sure if the scan I'm using is the same as what is mentioned here from elasticsearch.helpers.scan. If so, is there some setting I can specify? For example, should I set scroll or request_timeout to make sure the cache is cleared after each call to scan, maybe with an addition of sleep to make sure the timeout is met?
Also, if clear_scroll=True by default, then I would not need to explicitly clear the scroll_id, is that correct?
I'm not a Python expert but I saw this in the docs:
size – size (per shard) of the batch send at each iteration.
I'm curious about this. On how many shards is the scan operation running? To me it should not be more than 10000 per shard.
Also, may be your documents (_source field) are super big? In which case, reducing the size could help?
I'm been thinking about the cause of the issue. I've ran the loop a couple of times (different post-processing of the results) prior to getting the error, and had no problem. The size of the data mentioned in the error (32GB) is also much larger than the size of the results from a single call of scan.
Someone else advised that it could be because the cache was not cleared, so ES eventually ran out of memory after so many calls. Although it seems like scan by default has clear_scroll = True.
If it is somehow due to the accumulative memory usage from multiple calls, how would reducing the size in scan help?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.