100% cpu system time used on hdd data node

The ES version is 7.11.1
We use 16C64G vm which has 4 physical hdd disks(striped lvm volume) as warm data node.

The issue is that the cpu system time of a random vm offen suddenly rises up to 100%, and then the vm keeps hanging until it leaves the cluster.

I use top and pidstat to confirm that the process is elasticsearch, and "perf top" shows like this:

71.51%  [kernel]                      [k] __pv_queued_spin_lock_slowpath
       1.75%  [kernel]                      [k] _raw_spin_lock_irqsave
       1.42%  [kernel]                      [k] compact_checklock_irqsave.isra.24

or like this:

7.89%  [kernel]                      [k] isolate_freepages_block
   3.96%  [kernel]                      [k] __pv_queued_spin_lock_slowpath
   3.63%  [kernel]                      [k] copy_user_enhanced_fast_string
   1.75%  [kernel]                      [k] __list_del_entry

Is this a bug, or something else?

What do your hot threads or slow logs or Elasticsearch logs show at this time?

I used to change the hostname of each node and reinstall ES from version 5.6.3 to version 7.11.1, and then add them to another cluster.
After rebooting the system 2 days ago, everything is ok now.I forgot to get the hot thread info.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.