Nodes are not responsive after a large query


We have a hot-warm cluster for time-based metrics. We have 3 hot nodes with 30.5GB RAM and 900GB SSDs, 3 warm nodes with 30.5GB RAM and a large network storage and 3 dedicated weaker master machines.

Some of our queries contains a lot of aggregations, and when we send this query, the RAM usage of the machine that receives this request spikes and then we see a lot of
[WARN ][monitor.jvm ] [warn_1] [gc][old][591640][101] duration [28.9s]

At this point, the machine stops responding to ping\query request and we can't even restart the service. I have to kill the process (sudo kill -9) and start it again.

I tried to analyze the queries but ES 2.4.5 doesn't support analyzing the aggregations.
Any idea how can we:

  1. Analyze our aggregations so we can improve the search times and memory usage?
  2. Prevent the machine from dropping out of the cluster? perhaps some sort of a circuit breaker?

Are all nodes having issues or just one of the zones? How much data do you have on the nodes? How many indices/shards? How much of this are you aggregating across when you see issues? What type of aggregations are you running?

It happens to a single node - the node that gets the search request.
The warm nodes have ~900GB of data => 3B documents. We have ~20 index types X index per day.
The query is performed on a single index type, usually across 14 days => 14 indices where a single day index has ~2GB with 5M documents, 3 shards per day index.

The aggregations we use are: terms, histogram, sum and max.
We use "order_by" in some of the aggregations as well as a breadth_first collection mode. I know this can take a toll on memory usage - we added it because it gave us speedier queries back in the day, but I rather have a slow query instead of a failing machine :slight_smile:

When you say count do you mean value count or cardinality ?

I'm sorry, I made a mistake - 'count' is just a name of an aggregation, we don't actually use a 'count' aggregation. I'll edit my original answer.

Aggregations are capable of creating combinatorial explosions of values during execution if you have many nested aggs and many unique values.
Using breadth_first collection mode can help but depending on your choice of order_by may actually be ignored (the top-level agg asked to do breadth_first may conclude that it needs to do depth_first execution if you are ordering by a calculation performed in a child agg).

Another approach when you have too many combinations is to use multiple requests to explore less stuff in each request. The terms aggregation in 5.2+ has a partitioning feature that can support this technique or you can potentially do the same if you use a query that filters doc selections into non-overlapping sets per request.

We actually had order_by based on a child agg. We'll try to remove this and see if this helps a bit.
As for splitting the requests - it's doable on our side (until we'll upgrade to ES 5 in the near future).

Any idea how can we prevent the failing nodes?

  • Look at using a newer version (we continually aim to improve circuit breakers and reduce memory use)
  • Break queries into multiple requests
  • Fail faster - try the timeout setting in the search request in the hope that rogue searches will be cut-short before they cause GC issues.

Thank you, I will try to follow these tips and update on the results.

It seems that the timeout paramter affects the query, but not the aggregations. Or perhaps it just doesn't affect the merge-sort part which is probably the more difficult part.

timeout will curtail aggregation collection activity on shards. Aggs collect results as part of the timed search loop that matches docs. The final reduce phase however is not subject to further timer checks.

Thanks, can it work hard? Because it has a high RAM.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.