Search response time doubled erratically

Hi,

Our system recently faced a CPU usage spike and the underlying reason is still unknown. We have faced high memory usage and disk alerts in the past, since we run a nightly job of bulk indexing, updating almost all our docs. But high CPU usage has not been a problem.

The data collected so far:

Node 03 (out of 6 data nodes and 3 master) suffered from high CPU usage (> 95%) for 5 minutes, resulting in a response time spike of 1 sec, while the average response time is 40 ms.
Looking through the metrics, there was a slight bump in the indexing count on the given high CPU node, at the same time, there was a slight bump in Young GC (nothing like a spike though, in both cases).

I am not ruling out heavy indexing, since we do have a kafka consumer accepting bulk indexing data any time of data, but that is controlled at a speed of max 250 docs per second with a lag time of 250 ms between each bulk call.

Also, the hot threads endpoint did give some data, although I am not able to decipher it yet.

Link to Hot threads

Hey,

so from the hot_threads output it rather looks, as if search was eating some CPU (but by far now all of this, so it might be fine actually), as the threads being mentioned look like [shopo-elasticsearch-prd-sg2-02][search][T#3]

You should also check your GC statistics (part of the node stats) and can also check your log files for long running GCs.

--Alex

There has been some development. After the spike, CPU usage decreased gradually and is normal.
However, our response time is consistently staying between 70-250 ms (Usual average - 35-100 ms).
There is a near-to-toothsaw (not exactly a uniform toothsaw ) pattern in the response currently.

As per your question, there was a small bump in old GC count when the spike occurred.

Haven't found any anomaly in the node stats. Will update when found. Still posting for investigation.

node stats

Also posting the recent hot thread.
hot_thread_2