CPUs at 100%, but no disk I/O - ES 7.3

  • ES Version 7.3
  • Various node sizes (hot/warm) (25+ nodes)
  • SSDs on all nodes
  • JDK 11
  • X-Pack Document level security (thank you for fixing the bitset issue in 7.3)

Periodically our warm nodes that are sitting at 0% CPU utilization will spike to 100% for several minutes. After several minutes, they will drop back down to 0%. During these spikes, iostat shows 0 disk I/O. In fact, Disk I/O on our warm nodes is very very low. The operating system confirms the CPU spike is from ES (top). The lack of disk I/O is concerning as it makes me feel like I cant shard my way out of the spike.

Here is a link to a snipped of our hot_threads:

Any suggestions?

Any activity in the gc logs?

gc.logs are normal (we run them through an analyzer as well just to be sure). Boxes have a lot of free heap. Thank you so much for responding!!

We did notice this issue increase after enabling document level security with our own realm...

Thanks,
Aharon

if that CPU spike happens again, use the hot threads API to figure out in which part of the Elasticsearch code time is spent.

We noticed the same behaviour on our cluster as well. This was caused by queries generated by Kibana's KQL Value Suggestions feature while user is writing his query. We have several aliases targeting >1K indexes. The value suggestion mechanism is sending queries against all indices to discover the top terms but without considering the selected time frame. This results in a huge load on the cluster...
Our temporary solution was to disable KQL value suggestions in Kibana until we further investigate the issue.
Maybe you are facing the same issue...

spinscale, I may have already posted a hot-threads output in my first post (unless I used the API wrong)

Bertrand, thank you so much for your suggestion! I logged into our production environment, ran iostat on a warm node while I typed a field_name: in the discover search bar... The search time frame was the default of 15mins. All of a sudden all my warm nodes went to 25% CPU with no disk I/O. My warm nodes have data from >7 days ago... So you are correct, the Kibana value suggestions is not time aware. We will be disabling it.

Elastic team - I think that the Kibana value suggestions play a role, but not the entire role as I was never able to get it consume more than 25% of the processors on our warm nodes. Is there anything else similar that I can disable that may not be time aware and inadvertently query our warm nodes?

I must have missed the hot threads output, sorry for that.

Regarding speeding up the suggestions, you may find https://github.com/elastic/kibana/pull/37643 interesting.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.