100% CPU after upgrade (7.1.1 to 7.3)

After upgrading to 7.3 over the weekend, I now have a node that constantly sits at full CPU utilization. _nodes/hot_threads is empty. The cluster has 25 indices, 250 total shards, and is made of up 3 machines, with each machine having 2 cores and 8gb of memory.

Replacing the high cpu-using node with a new machine did not fix the situation; high cpu usage came back after rebalance. Are there any known steps to fix or this is something new that was introduced in 7.3?

1 Like

This is surprising, particularly since hot threads is empty. Could you share the full output of the following, using something like https://gist.github.com since it will be quite large.

GET _nodes/hot_threads?threads=99999&ignore_idle_threads=false

Another possibility is that it's busy doing GC, which won't show up in the hot threads. Can you share the last thousand lines or so of the GC log too?

Here is the hot threads output you asked for: https://gist.github.com/icheishvili/3e7cd9382ae34c616df9e601f4771751

And here is the last 1000 lines of gc.log: https://gist.github.com/icheishvili/a8075376002ced072bfec2e8e3febebe

From what I can tell, GC behavior on all 3 nodes is quite similar; what caused me to check is seeing Young Allocation Failures when reading the log so I went to confirm, but happy to post more gc logs to show this.

The misbehaving node has gotten worse and worse (up to a load avg of 20) and this has made our entire deployment unstable so we are being forced to revert back to 7.1.1. I would advise anyone reading this to carefully test 7.3.0 in their environment/traffic pattern or avoid it entirely.