Elasticsearch heap management and GC question

Background: We have 32 ES instances running in a bare metal server, all with the same dataset replicated, and namespaced from each other (think containers). Recently we've been noticing a gradual increase in CPU load in this machine, until load average reached and apparently it's due to all the instances running GCs frequently.

Now, after we removed some (supposedly) unused indexes (totaling around 40GB of data), load average fell down dramatically, from 40-50 to ~2.

My assumption is that our indexes grew to a point that all (or almost all) of the available heap was consumed (3GB), and this forced each ES instance to perform very frequent GCs (aka. heap trashing). This freed up some heap momentarily until it was consumed back again.

However, I think that there's a hole in my thinking, and this is where I would like your help. I have no idea as to how ES uses the available heap with regards to its indexes:

  1. Does ES loads every index in memory until all available heap is consumed? If so, what happens if the data required for the indexes to be loaded in memory, are bigger than the max heap limit?
  2. In general, could an increase in the index data cause ES to fall in some kind of heap trashing state, i.e. where the consumed heap is right at the max heap threshold, and therefore very frequent GCs are performed?
  3. I guess my question can be reduced to: does ES absolutely needs all of the indexed data in memory to operate? Or (in case they don't fit in memory) it can fall back to using the disk?

Thanks in advance!

You should have seen evidence of this in your logs, particularly in the GC log. Did you?

No, the vast majority of your data lives on disk (or in filesystem cache) and not on the heap at all. It's not unusual to see ram:disk ratios in the region of 1:160. There's a certain amount of heap overhead per shard, and a common problem is not "too much data" so much as "too many shards". Here's a blog post that might help:

Thanks for the reference.

Yes, the GCs can be observed from the logs. For example, during a single day, an instance performed 10k GC runs. This is in contrast to normal days where only 50-500 GCs are performed:

[2020-04-28T06:25:18,491][INFO ][o.e.m.j.JvmGcMonitorService] [node_foo] [gc][449522] overhead, spent [371ms] collecting in the last [1.3s]             
[2020-04-28T06:25:23,492][INFO ][o.e.m.j.JvmGcMonitorService] [node_foo] [gc][449527] overhead, spent [393ms] collecting in the last [1s]               
[2020-04-28T06:25:27,561][INFO ][o.e.m.j.JvmGcMonitorService] [node_foo] [gc][449531] overhead, spent [325ms] collecting in the last [1s]               
[2020-04-28T06:25:31,900][INFO ][o.e.m.j.JvmGcMonitorService] [node_foo] [gc][449535] overhead, spent [553ms] collecting in the last [1.3s]             
[2020-04-28T06:25:36,355][INFO ][o.e.m.j.JvmGcMonitorService] [node_foo] [gc][449539] overhead, spent [704ms] collecting in the last [1.4s]             
[2020-04-28T06:25:41,442][INFO ][o.e.m.j.JvmGcMonitorService] [node_foo] [gc][449544] overhead, spent [401ms] collecting in the last [1s]               
[2020-04-28T06:25:45,620][INFO ][o.e.m.j.JvmGcMonitorService] [node_foo] [gc][449548] overhead, spent [355ms] collecting in the last [1.1s]             
[2020-04-28T06:25:50,623][INFO ][o.e.m.j.JvmGcMonitorService] [node_foo] [gc][449553] overhead, spent [460ms] collecting in the last [1s]               
[2020-04-28T06:25:54,626][INFO ][o.e.m.j.JvmGcMonitorService] [node_foo] [gc][449557] overhead, spent [462ms] collecting in the last [1s]               
[2020-04-28T06:25:59,630][INFO ][o.e.m.j.JvmGcMonitorService] [node_foo] [gc][449562] overhead, spent [358ms] collecting in the last [1s]               
[2020-04-28T06:26:03,634][INFO ][o.e.m.j.JvmGcMonitorService] [node_foo] [gc][449566] overhead, spent [375ms] collecting in the last [1s]               
[2020-04-28T06:26:08,637][INFO ][o.e.m.j.JvmGcMonitorService] [node_foo] [gc][449571] overhead, spent [407ms] collecting in the last [1s]               
[2020-04-28T06:26:12,640][INFO ][o.e.m.j.JvmGcMonitorService] [node_foo] [gc][449575] overhead, spent [296ms] collecting in the last [1s]               
[2020-04-28T06:26:26,707][INFO ][o.e.m.j.JvmGcMonitorService] [node_foo] [gc][449589] overhead, spent [373ms] collecting in the last [1s]               
[2020-04-28T06:26:30,710][INFO ][o.e.m.j.JvmGcMonitorService] [node_foo] [gc][449593] overhead, spent [289ms] collecting in the last [1s]               
[2020-04-28T06:26:35,713][INFO ][o.e.m.j.JvmGcMonitorService] [node_foo] [gc][449598] overhead, spent [322ms] collecting in the last [1s] 
# ...
[2020-04-28T10:40:55,696][INFO ][o.e.m.j.JvmGcMonitorService] [node_foo] [gc][old][464462][103504] duration [5.3s], collections [1]/[5.5s], total [5.3s]/[9.2h], memory [2.8gb]->[2.3gb]/[2.9gb], all_pools {[young] [819.2mb]->[381.1mb]/[819.2mb]}{[survivor] [97.5mb]->[0b]/[102.3mb]}{[old] [1.9gb]->[1.9gb]/[2gb]}

The above goes on for the whole day; every minute of the day ~4-5 GCs are performed.


After the removal of the unused indices (and after the load got back to normal), each ES cluster has now the following:

  • 1 node
  • 27 indices
  • 68 shards
  • 80,184,574 docs
  • 156.96GB
  • heap used: 74% (as reported by cerebro)
  • process CPU: 1% (as reported by cerebro)
  • load: 3.44 (as reported by cerebro)

The 4 removed indices, were worth about 40GB and had up to 6 shards each.

From the post you've linked to, it seems to me that this is a relatively small cluster that would not justify so many/frequent GCs.

On the other hand, since the heap used right now is at 75%, maybe it's not too unrealistic that pre-deletion (with the 4 additional indexes), the heap was near 100% or some threshold that could cause GCs?

Does this make any sense? Thoughts?

157GB split across 68 shards sounds inefficient: that's an average of just 2.3GB per shard which is pretty small. Can you shrink any of these indices down to a single shard and/or consolidate them into fewer indices?

Also the post I linked suggests limiting yourself to 20 shards per GB of heap, so if you've only 3GB of heap then you should try and stay below 60 shards. You now have 68 shards having removed 24, so when you were having trouble you had more than 90 shards.

Also note some big improvements in heap usage with 7.7 so maybe the simplest solution is to upgrade before digging any deeper.

2 Likes

Another possible source of JVM heap usage are global ordinals (they are shown within the field data stats) or a high number of segments per shard.

As @DavidTurner suggests, the shard size seems low.
You can also refer to https://www.elastic.co/guide/en/elasticsearch/reference/current/avoid-oversharding.html

1 Like

Thanks everyone for the insights. Just to clarify, frequent GCs are almost always the result of the heap being full (or close to the max heap size), right?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.