Nodes are not responsive after a large query

Eldad_Moneta · June 29, 2017, 7:40am

Hi,

We have a hot-warm cluster for time-based metrics. We have 3 hot nodes with 30.5GB RAM and 900GB SSDs, 3 warm nodes with 30.5GB RAM and a large network storage and 3 dedicated weaker master machines.

Some of our queries contains a lot of aggregations, and when we send this query, the RAM usage of the machine that receives this request spikes and then we see a lot of
[WARN ][monitor.jvm ] [warn_1] [gc][old][591640][101] duration [28.9s]

At this point, the machine stops responding to ping\query request and we can't even restart the service. I have to kill the process (sudo kill -9) and start it again.

I tried to analyze the queries but ES 2.4.5 doesn't support analyzing the aggregations.
Any idea how can we:

Analyze our aggregations so we can improve the search times and memory usage?
Prevent the machine from dropping out of the cluster? perhaps some sort of a circuit breaker?

Christian_Dahlqvist · June 29, 2017, 7:49am

Are all nodes having issues or just one of the zones? How much data do you have on the nodes? How many indices/shards? How much of this are you aggregating across when you see issues? What type of aggregations are you running?

Eldad_Moneta · June 29, 2017, 8:14am

It happens to a single node - the node that gets the search request.
The warm nodes have ~900GB of data => 3B documents. We have ~20 index types X index per day.
The query is performed on a single index type, usually across 14 days => 14 indices where a single day index has ~2GB with 5M documents, 3 shards per day index.

The aggregations we use are: terms, histogram, sum and max.
We use "order_by" in some of the aggregations as well as a breadth_first collection mode. I know this can take a toll on memory usage - we added it because it gave us speedier queries back in the day, but I rather have a slow query instead of a failing machine

Mark_Harwood · June 29, 2017, 8:35am

When you say count do you mean value count or cardinality ?

Eldad_Moneta · June 29, 2017, 8:41am

I'm sorry, I made a mistake - 'count' is just a name of an aggregation, we don't actually use a 'count' aggregation. I'll edit my original answer.

Mark_Harwood · June 29, 2017, 10:21am

Aggregations are capable of creating combinatorial explosions of values during execution if you have many nested aggs and many unique values.
Using breadth_first collection mode can help but depending on your choice of order_by may actually be ignored (the top-level agg asked to do breadth_first may conclude that it needs to do depth_first execution if you are ordering by a calculation performed in a child agg).

Another approach when you have too many combinations is to use multiple requests to explore less stuff in each request. The terms aggregation in 5.2+ has a partitioning feature that can support this technique or you can potentially do the same if you use a query that filters doc selections into non-overlapping sets per request.

Eldad_Moneta · June 29, 2017, 11:12am

Thanks!
We actually had order_by based on a child agg. We'll try to remove this and see if this helps a bit.
As for splitting the requests - it's doable on our side (until we'll upgrade to ES 5 in the near future).

Any idea how can we prevent the failing nodes?

Mark_Harwood · June 29, 2017, 11:19am

Look at using a newer version (we continually aim to improve circuit breakers and reduce memory use)
Break queries into multiple requests
Fail faster - try the timeout setting in the search request in the hope that rogue searches will be cut-short before they cause GC issues.

Eldad_Moneta · June 29, 2017, 12:09pm

Thank you, I will try to follow these tips and update on the results.

Eldad_Moneta · June 29, 2017, 12:15pm

It seems that the timeout paramter affects the query, but not the aggregations. Or perhaps it just doesn't affect the merge-sort part which is probably the more difficult part.

Mark_Harwood · June 29, 2017, 2:02pm

timeout will curtail aggregation collection activity on shards. Aggs collect results as part of the timed search loop that matches docs. The final reduce phase however is not subject to further timer checks.

qwerty2002 · June 30, 2017, 6:48am

Thanks, can it work hard? Because it has a high RAM.

system · July 28, 2017, 6:48am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Warm nodes respond poorly Elasticsearch	5	378	December 10, 2023
ElasticSearch Performance Elasticsearch	4	348	October 12, 2020
Elastic warm node heap unexpected behavior Elasticsearch	1	409	August 2, 2018
ElasticSearch nodes not responding anymore - please help! Elasticsearch	9	3756	July 5, 2017
High number of simultaneous searches make the requests response time bigger Elasticsearch	1	624	July 22, 2021

Nodes are not responsive after a large query

Related topics