100% CPU and GC on all nodes

Hi guys,
we recently did a rolling upgrade on our ES Cluster (40+ Nodes, 120+ indices) from 5.6.14 to 6.8.1.
We had many issues which we could eventually fix.

One of the issues we still have however is, that the whole Cluster is at 100% CPU with the logs looking like this:

[2019-07-20T15:43:34,571][INFO ][o.e.m.j.JvmGcMonitorService] [server1] [gc][289] overhead, spent [309ms] collecting in the last [1s]
[2019-07-20T15:43:38,308][INFO ][o.e.m.j.JvmGcMonitorService] [server1] [gc][292] overhead, spent [580ms] collecting in the last [1.4s]
[2019-07-20T15:43:39,561][INFO ][o.e.m.j.JvmGcMonitorService] [server1] [gc][293] overhead, spent [358ms] collecting in the last [1.2s]

...

Hot threads Output:
Pastebin

Also: We're running up to 4 nodes on one Server. Each node has 30GB JVM heap.

Any idea how to debug this properly?

Thanks! :wink:

I'm not sure what is the cause for your GC issue, but it seems clear that Elasticsearch is struggling to free memory it needs to operate. This could be because of query caches filling up quickly or that too many tasks are queued up (you could look for rejected tasks, a sign of cluster overloading).

As for running 4 nodes on one physical server you should take into account that Lucene uses file system caching to speed up queries, ideally Lucene should be able to read the most queried data from file system caches and only access the disk for less frequent data.

By assigning too much RAM to the JVMs you may starve the file system cache; if each of your 4 nodes are running JVMs with 30GB RAM you've effectively locked 120GB of the physical memory on that server which may not leave much for the file system cache, causing more of your queries to access the disk to fetch data.

You could try to experiment with the JVM sizes, reducing them to say 16GB or 8GB (which is what I'm using in my clusters) to see if that reduces the GC frequency and duration. Since GC only kicks in above a certain memory percentage, a smaller JVM will be garbage collected more often but also quicker than a big one - which may help in your case.

Good luck!

What is the full output of the cluster stats API? What is the hardware specification of the nodes this cluster is running on?

Over the last few hours the situation seems to have cleared up slightly.
Only one node is still struggling and indexing only goes on bit for bit.
Just as a test I closed the indices with active shard movement. This might have helped a bit?

My nodes seem to have rejected writes and searches all the way from 1000 to over 800000.
This is the output of the node which still has high cpu load:

server2171-2 search              97 974 737559
server2171-2 write                7   0   8394
server2171-4 search              97 919 709452
server2171-4 write                9   2   5543
server2171-3 search              97 954 616195
server2171-3 write               23   0   7021
server2171   search              97 997 876564
server2171   write                7   0   9643

As for the hardware specifications:
CPU: 64T+
RAM: (2 * [Number of ES instances] * 30GB) + 15GB+ extra headroom (In this case 280GB)
Disk: ZFS volume on SSD or SAS RAID

I will try this tomorrow.

Here is an output:
Pastebin

And here is a screenshot of htop:

I have a couple of comments:

  • It looks like you only have 2 master eligible nodes. This is bad given that Elasticsearch is based on consensus argorithms requiring a majority of master eligible nodes to be present to elect a master. You should therefore always have at least 3 master eligible nodes and make sure minimum_master_nodes is set correctly. As it is now it is likely you cluster either is not highly available or misconfigured, which can lead to data loss.
  • It seems like a significant portion of your shards are not replicated and your cluster is currently in a red state.
  • The nodes are using a variety of OS and JVM versions. Not sure what effect this may or may not have.
  • It seems all nodes are configured as ingest nodes. Are you using ingest pipelines extensively or is this just the default setting?
  • You are using SearchGuard to secure your cluster. I have never used this, so do not know to what effect this affects heap usage, GC and GC patterns.
  • Given that you have different types of storage across the hosts, are you ru nning a hot-warm architecture? If you are - how is work distributed across them? Are the problems spread across all nodes?

This cluster was created in ES 1 and upgraded all the way to ES 6. That's why some things might not be optimal.

We don't use replicas for our data. (yet)
The red state is caused by some nodes being offline (We can't run 6 nodes at once since upgrading to ES6 but this is a different issue.)
The recent indices are green though.

This is the default setting and a relic of ES2

We don't use the built in hot-warm architecture (yet).
We have two tiers (high-performance/low-performance) and allocate indices by age.

EDIT:
Here are some graphs from ES Monitoring:

EDIT2:
If I close every index except the current one indexing works like a charm.
Once I open yesterdays index everything goes to 100% again.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.