Error while using .scan() function call

Hi,

I have the following Python code to query my ES cluster (v8.8.0).

from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search

for i in range(0, 100):
    s = Search(using=es, index=index_name) \
          .filter('range', **{'@timestamp': {'gte': start_datetime, 'lt': end_datetime, 'format': 
    'strict_date_optional_time_nanos'}})

    if s.count() > 0:
       results = s.scan()
          for result in results:
              result_json = result.to_dict()
              res1, res2 = process_result(result_json)

The number of documents returned in each scan is between 1-2M.

During one of the queries, I got an error on ES:

circuit_breaking_exception: [parent] Data too large, data for [<http_request>] would be [32283148058/30gb], which is larger than the limit of [31621696716/29.4gb], real usage: [32283146480/30gb], new bytes reserved: [1578/1.5kb]...

It seems like ES ran out of memory when executing the query. Is there something I can specify in my query so as to avoid this issue?

I'm not sure if the scan I'm using is the same as what is mentioned here from elasticsearch.helpers.scan. If so, is there some setting I can specify? For example, should I set scroll or request_timeout to make sure the cache is cleared after each call to scan, maybe with an addition of sleep to make sure the timeout is met?

Also, if clear_scroll=True by default, then I would not need to explicitly clear the scroll_id, is that correct?

Thank you.

I'm not a Python expert but I saw this in the docs:

  • size – size (per shard) of the batch send at each iteration.

I'm curious about this. On how many shards is the scan operation running? To me it should not be more than 10000 per shard.
Also, may be your documents (_source field) are super big? In which case, reducing the size could help?

On the query that failed, the error message was the "Scroll request has only succeeded on 11 shards out of 12".

Each document should be under 1KB. Is that considered big? I don't quite understand what this parameter mean.

size is the number of documents you are asking for each scroll request. Try to reduce it and see if it helps?

1 Like

What @dadoonet said is correct. To reduce the size of the scan using the Elasticsearch DSL client, you need to use the params() function like this:

from elasticsearch_dsl import Search

es = Elasticsearch(...)

s = Search(using=es, index="...")
for result in s.params(size=100).scan():
    ...

I also want to mention that you might want to look at point-in-time search (Point in time API | Elasticsearch Guide [8.10] | Elastic) which is more robust than scrolling.

2 Likes

Thanks, @dadoonet and @Quentin_Pradet I'll try to reduce size to see if it works, and also look at the Point in time API.

I'm been thinking about the cause of the issue. I've ran the loop a couple of times (different post-processing of the results) prior to getting the error, and had no problem. The size of the data mentioned in the error (32GB) is also much larger than the size of the results from a single call of scan.

Someone else advised that it could be because the cache was not cleared, so ES eventually ran out of memory after so many calls. Although it seems like scan by default has clear_scroll = True.

If it is somehow due to the accumulative memory usage from multiple calls, how would reducing the size in scan help?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.