Elasticsearch process causes CPU soft lockup (causing the server to hung)

We experience hangs on multiple machines that are running elastic data node only.
When it accrued you see a soft lockup symptom of a task or kernel thread using and not releasing a CPU for a longer period of time than allowed. You can see from the logs that it is from the java process (elastic is the only java process running on that machine)

May 14 05:24:21 localhost kernel: [6006808.160001] watchdog: BUG: soft lockup - CPU#5 stuck for 23s! [java:5783]

This has already occurred in 5 different machines

A full description is in https://github.com/elastic/elasticsearch/issues/30667


hi @jasontedor, as you closed the issue on git, I was wondering if you could elaborate more here on why you are sure this isn't an issue in elastic?
The reason I'm asking is that we are currently in an ongoing investigation with Azure which already opened a ticket to canonical about this issue. When Azure discuss this issue with canonical they were insisting that the issue is with the java process (in this case elastic) that is not releasing the CPU.
If you could provide us any information or guide us to find proof that the issue is not an elastic issue it will help expedite the process of finding the root cause.

Thanks for your help!

Please read through the thread that I linked to from GitHub. It looks identical to the problem that you're experiencing, is on the same kernel, and references several LKML threads on similar issues. This screams kernel issue, not Java issue to me, and is almost surely not an Elasticsearch issue. I am open and willing to be proven wrong, but it will require evidence.

1 Like

@jasontedor thanks for the quick response. It appears you are right and it is a kernel issue on Azure which should be fixed in the next kernel update.
FYI, in case you have other customers who are having the same behavior.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.