Heap usage issue

Hi!

I've got an issue with one visualization. There's a high cardinality field (HCF) I need to check once in a while. My visualization (line chart) shows the TOP 10 instances by unique count. When I run it on the past hour, one of the nodes hits the heap memory limit (circuit breaker) and the shards on that node goes unassigned.

None of the other instances flinch at at all. When I do the same on another cluster the "load" seems to be distributed. The two clusters are running almost identical setup. The one with the issue described above is running on a newer env (OpenJDK 11). I can't put my finger on the problem here.
Everything else runs smoothly, CPU utilization is between 20-30%, heap around 50%, average load around 1, no other visualizations, dashboards, complex aggreagations have similar effect on the cluster.
Is there a way to prevent this behavior (10GB spike in heap)?

Specs:
ES version: 7.2.1
Cluster has 6 data nodes each with

  • 55GB RAM/25GB heap
  • 8 core (2.3 GHz)

Index in question (at query time):

  • 100M+ documents
  • 10 shards (5 primary, 5 replica)
  • size 73 GB+
  • unique instances ~20K
  • HCF 500K+ unique values

Thank you,
YvorL

What is the configuration of the dashboard that is experiencing these problems?

I don't even need a dashboard. The line visualization is set to show the unique count of the field with date histogram (auto), split series by instance name. When I set the time interval to 1h it crashes. Since it's a production env, I did only test this about three times in the past 3 days.

What is the configuration of that line visualisation?

Data

  • No filters
  • Metrics: Unique count of the high cardinality field
  • Buckets: Date histogram (auto interval) -> Split series by instance name (size:10)
  • Time interval: last 1 hour

Panel settings:

  • Legend position bottom
  • Grid show X-axis lines

I think everything else is default Kibana (7.2.1) setting.

@Christian_Dahlqvist

If that wasn't what you meant by configuration, please forgive me. As you see I'm trying to provide as much information as I can.

I'm sorry if I look impatient, but @Christian_Dahlqvist reacted on a Sunday within an hour and nothing since that. Is there anyone who's able to help me debugging the issue I have? I do need a solution to this and currently, it only looks like if someone was dealing with the question while in reality, nothing happened.

I do unfortunately not have time to look into this at the moment. Note that this forum is manned by volunteers and there are no SLAs or guarantees of a response or resolution.

Thank you, I do appreciate your (as in plural) time!

Any chance that someone can help out on this topic?

@HenningAndersen I'm sorry to tag you in but I saw this thread which is similar to my issue. Here are my current custom JVM settings:

-Xms30g
-Xmx30g
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly
10-:-XX:-UseConcMarkSweepGC
10-:-XX:-UseCMSInitiatingOccupancyOnly
10-:-XX:+UseG1GC
10-:-XX:InitiatingHeapOccupancyPercent=45
10-:-XX:G1ReservePercent=15
-XX:MaxGCPauseMillis=700
-XX:NewRatio=4
-XX:SurvivorRatio=6

Do you have any idea how can I stop this behavior?

Hi @YvorL,

I see that you have set a few GC parameters that are non-standard. Can you explain why those have been set?

I think the first thing to try out is increasing the G1ReservePercent as per the thread you linked to (to 25). But it is kind of an unqualified guess based mostly on you linking that thread more than actual knowledge.

Also, you could try deleting the last 3 of those GC settings (-XX:MaxGCPauseMillis, NewRatio and SurvivorRatio) unless you know they are helping you.

Information on G1GC setting is sparse and read various articles (e.g., from Sematext) and performed some tests too. While I'm not an expert on garbage collection, I reckon the reserve will help with giving more leeway for the request. Since my first post, I've scaled the cluster to be able to give it more RAM and added two more nodes to distribute shards on more servers. Unfortunately, that didn't help either. When the index in question receives the cardinality requests, at least one node will hit the limit. The index itself is huge and I'm not sure what Elasticsearch stores in the memory during the search but the last time I run the query, there was a spike of 48 GB in JAVA heap memory (throughout 6 nodes).
I'll try changing the GC options but I doubt that it will help on the issue. But I'm out of ideas so I can try that.

@polyfractal This is the original thread I opened. Since then there were numerous change to mitigate the issue but none of those helped. I'm getting a better grasp on the problem (heap explosion) itself but I still lack a lot of knowledge about ES and JAVA to know how can make it go away.

Heya @YvorL sorry again for the runaround between here and github :slight_smile:

I just left a reply because a colleague of mine spotted a detail: the terms agg is sorted by cardinality, which means we have to execute in depth-first rather than breadth-first mode. That unfortunately makes the agg very expensive and moves it into the "abusive" category, even though it seems very innocent from the outside perspective.

For posterity, here's what I mentioned in https://github.com/elastic/elasticsearch/issues/55240:

So as it turns out, jimczi noticed a detail that I missed: the terms agg is sorted by the cardinality agg:

"order": {
  "1": "desc"
},

Which actually puts the query into the "abusive" category, even though it looks pretty innocent. What happens is that sorting by sub-agg means we have to execute the terms aggregation in depth_first mode.

Normally, terms aggs execute breadth_first , meaning it collects the list of terms and their counts, finds the top-n and prunes away all the rest of the terms. It then executes the next layer of aggregations on those top-n buckets. We can do this pruning because the top-n is determined by count.

But when you sort by a sub-aggregation, there is no way to know which buckets to prune because it's dependent on not-yet-calculated quantity. This means we have to switch to depth_first , where we fully process the entire aggregation tree before we can do any pruning of results.

In the above query, this means we actually collect instance.keyword number of buckets (for each 30s interval), with a corresponding HLL sketch for the cardinality. This can get very expensive very quickly if instance.keyword has a moderate cardinality, and will quickly trip the circuit breaker as you're seeing.

So I think this really is a case of "expensive query" tripping the breaker, even though it looks very innocent to a user.

I'll chat with the Kibana folks and see if there is some way to proactively warn users about sorting by sub-agg, particularly by cardinality which is relatively expensive (~kilobytes per bucket, rather than ~bytes). Or perhaps Kibana can default to a lower precision when it sees the user is sorting by cardinality. We might also want to trip the agg circuit breaker faster than 70-80% to help make this less impactful on nodes.

Apologies for missing this detail earlier!

2 Likes

Thank you @polyfractal for helping out!
I see that it's not a 60*5 bucket query but a 40000*30 one. So it wouldn't matter if I changed the returned sub-aggregation count since those are calculated anyway.
The weird thing that I didn't see this behavior before. On a separate cluster everything was set up the same way except Java version (8->11), GC mode (CMS->G1GC), and replica shards (there wasn't any).
As you saw I threw a lot of hardware at it but it didn't help. The instance cardinality will likely grow (not in an alarming rate, but it will) and the high-cardinality field will do as well. It seems that I can't lower the general heap memory usage, I'm already at 30GB heap per node and have 32 cores handling only Elasticsearch on the server. Not sure how much memory would that "abusive" query need. Also per my experience that pre-allocation doesn't happen on all nodes, so it isn't distributed. Which makes me think adding more ES instances won't help either. Aside from lowering the precision in Kibana or in the queries, what options do I have?

If you want to grab a heap dump, I'd be happy to take a look to confirm it's the situation we think it is (should be relatively obvious from a heap dump if it's the depth-first + cardinality issue).

There's an added component here which is the coordinator node, and may explain why you are seeing uneven allocation. The coordinator receives results from all the shards and has to keep them in memory to perform the merge. So if you have 20 shards, the coordinator needs to keep 20*5*60 buckets in memory simultaneously to perform the merge, as well as associated metric data like the HLL sketch (in actuality, a bit more since we request shard_size from the nodes, not size, so it'll be a bit larger for the terms agg).

I think we need to discover what is eating your steady-state memory usage though. The nodes appear to idle ~21gb usage, which is 65% of the heap. So while the query does eat an out-sized portion of memory, the node itself is idling pretty high (for reference, we have the JVM configured to kickoff GCs at 75% normally, so this is pretty close to what we consider "memory pressure").

Looking over node-stats, hot threads, threadpools might be helpful in trying to pin down memory consumers. Or a heap dump.

Two things come to mind to help right now:

  • reducing precision as mentioned. This can have a dramatic effect on the HLL sketch size, but only increase error slightly. Even when you drop precision down to 100, error is still <5% in most cases. But that drops memory usage from 24kb per bucket down to 800b
  • Dedicated client nodes are potentially useful in this case. These are nodes that are sized similar to data nodes, but don't hold data. You then route expensive aggregation queries (or all queries) through the client nodes and they can use their spare memory to help handle large reductions. They aren't generally a great use of resources IMO, but can sometimes be helpful.
1 Like

@polyfractal Thanks for the descriptive answer! I was wondering about the coordinator node but wasn't entirely sure.
How would a heap dump work? I mean that's ~21GB of data (for one node) which isn't easy to transfer. Also, does it contain sensitive data of any kind?

Edit: I really appreciate your help debugging this!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.