Error while using .scan() function call

hjazz6 · September 14, 2023, 9:54am

Hi,

I have the following Python code to query my ES cluster (v8.8.0).

from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search

for i in range(0, 100):
    s = Search(using=es, index=index_name) \
          .filter('range', **{'@timestamp': {'gte': start_datetime, 'lt': end_datetime, 'format': 
    'strict_date_optional_time_nanos'}})

    if s.count() > 0:
       results = s.scan()
          for result in results:
              result_json = result.to_dict()
              res1, res2 = process_result(result_json)

The number of documents returned in each scan is between 1-2M.

During one of the queries, I got an error on ES:

circuit_breaking_exception: [parent] Data too large, data for [<http_request>] would be [32283148058/30gb], which is larger than the limit of [31621696716/29.4gb], real usage: [32283146480/30gb], new bytes reserved: [1578/1.5kb]...

It seems like ES ran out of memory when executing the query. Is there something I can specify in my query so as to avoid this issue?

I'm not sure if the scan I'm using is the same as what is mentioned here from elasticsearch.helpers.scan. If so, is there some setting I can specify? For example, should I set scroll or request_timeout to make sure the cache is cleared after each call to scan, maybe with an addition of sleep to make sure the timeout is met?

Also, if clear_scroll=True by default, then I would not need to explicitly clear the scroll_id, is that correct?

Thank you.

dadoonet · September 14, 2023, 10:25am

I'm not a Python expert but I saw this in the docs:

size – size (per shard) of the batch send at each iteration.

I'm curious about this. On how many shards is the scan operation running? To me it should not be more than 10000 per shard.
Also, may be your documents (_source field) are super big? In which case, reducing the size could help?

hjazz6 · September 14, 2023, 11:24am

On the query that failed, the error message was the "Scroll request has only succeeded on 11 shards out of 12".

Each document should be under 1KB. Is that considered big? I don't quite understand what this parameter mean.

dadoonet · September 14, 2023, 12:05pm

size is the number of documents you are asking for each scroll request. Try to reduce it and see if it helps?

Quentin_Pradet · September 14, 2023, 12:30pm

What @dadoonet said is correct. To reduce the size of the scan using the Elasticsearch DSL client, you need to use the params() function like this:

from elasticsearch_dsl import Search

es = Elasticsearch(...)

s = Search(using=es, index="...")
for result in s.params(size=100).scan():
    ...

I also want to mention that you might want to look at point-in-time search (Point in time API | Elasticsearch Guide [8.10] | Elastic) which is more robust than scrolling.

hjazz6 · September 14, 2023, 6:18pm

Thanks, @dadoonet and @Quentin_Pradet I'll try to reduce size to see if it works, and also look at the Point in time API.

hjazz6 · September 15, 2023, 2:17am

I'm been thinking about the cause of the issue. I've ran the loop a couple of times (different post-processing of the results) prior to getting the error, and had no problem. The size of the data mentioned in the error (32GB) is also much larger than the size of the results from a single call of scan.

Someone else advised that it could be because the cache was not cleared, so ES eventually ran out of memory after so many calls. Although it seems like scan by default has clear_scroll = True.

If it is somehow due to the accumulative memory usage from multiple calls, how would reducing the size in scan help?

system · October 13, 2023, 2:18am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Memory error when using scan In python to retrieve millions of documents Elasticsearch	3	847	August 14, 2020
Result window is too large, from + size must be less than or equal to: [10000] but was [11001] Elasticsearch	5	15776	July 5, 2017
ScanError while scrolling more than 10k docs Elasticsearch	6	444	November 20, 2023
Increase size limit - Python ElasticSearch Elasticsearch language-clients	8	1054	March 22, 2021
Is there a way to do scan with limit Elasticsearch	3	822	April 4, 2018

Error while using .scan() function call

Related topics