We have .NET app that use NEST to store events in elastic. Randomly after some time my elasticsearch cluster of 3 nodes stop working and i see these errors.
[indices:data/read/search[phase/query]]\nCaused by: org.elasticsearch.ElasticsearchException: Trying to create too many scroll contexts. Must be less than or equal to: [500]. This limit can be set by changing the [search.max_open_scroll_context] setting.\n\tat
When this error happen the memory and cpu of cluster increase like x4 times.
java.base/java.lang.Thread.run(Thread.java:1623)\nCaused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [<reduce_aggs>] would be [10211962019/9.5gb], which is larger than the limit of [10200547328/9.5gb], real usage: [10211961944/9.5gb], new bytes reserved: [75/75b], usages [inflight_requests=10553946/10mb, request=6373/6.2kb, fielddata=4464632454/4.1gb, eql_sequence=0/0b, model_inference=0/0b]\n\tat
Are this 2 errors related why is getting to context limit and not freeing context and everything get block and timeout? how to avoid these issues? what is the scrollsize recommendation use 100 or 1000 or 10000 to minimize this?
Hello, and sorry for not answering earlier. This is not an issue with the NEST client - having that many open scroll contexts is indeed going to hurt Elasticsearch.
Are this 2 errors related why is getting to context limit and not freeing context and everything get block and timeout?
The scroll search context can only be freed after the scroll timeout has elapsed, see Paginate search results | Elasticsearch Guide [8.15] | Elastic. This timeout is often set to 1m (one minute). If too many scroll contexts are kept open at the same time, your cluster can suffer.
While Elasticsearch will be able to reject new scroll contexts (first error), the memory pressure means that other requests can get rejected (second error). In that sense, the errors are related.
how to avoid these issues? what is the scrollsize recommendation use 100 or 1000 or 10000 to minimize this?
You could look into why your cluster is opening 500 hundred scroll contexts. That's a lot, do you know why you need that and why they all seem to be happening at the same time?
You could try to reduce the search.max_open_scroll_context limit if you'd like less impact on other requests.
Regarding the scroll size, having a larger amount could mean you get through the results quicker, and thus you keep around the scroll context for less time. But that could require a larger scroll timeout.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.