Our system recently faced a CPU usage spike and the underlying reason is still unknown. We have faced high memory usage and disk alerts in the past, since we run a nightly job of bulk indexing, updating almost all our docs. But high CPU usage has not been a problem.
The data collected so far:
Node 03 (out of 6 data nodes and 3 master) suffered from high CPU usage (> 95%) for 5 minutes, resulting in a response time spike of 1 sec, while the average response time is 40 ms.
Looking through the metrics, there was a slight bump in the indexing count on the given high CPU node, at the same time, there was a slight bump in Young GC (nothing like a spike though, in both cases).
I am not ruling out heavy indexing, since we do have a kafka consumer accepting bulk indexing data any time of data, but that is controlled at a speed of max 250 docs per second with a lag time of 250 ms between each bulk call.
Also, the hot threads endpoint did give some data, although I am not able to decipher it yet.