We experience hangs on multiple machines that are running elastic data node only.
When it accrued you see a soft lockup symptom of a task or kernel thread using and not releasing a CPU for a longer period of time than allowed. You can see from the logs that it is from the java process (elastic is the only java process running on that machine)
May 14 05:24:21 localhost kernel: [6006808.160001] watchdog: BUG: soft lockup - CPU#5 stuck for 23s! [java:5783]
hi @jasontedor, as you closed the issue on git, I was wondering if you could elaborate more here on why you are sure this isn't an issue in elastic?
The reason I'm asking is that we are currently in an ongoing investigation with Azure which already opened a ticket to canonical about this issue. When Azure discuss this issue with canonical they were insisting that the issue is with the java process (in this case elastic) that is not releasing the CPU.
If you could provide us any information or guide us to find proof that the issue is not an elastic issue it will help expedite the process of finding the root cause.
Please read through the thread that I linked to from GitHub. It looks identical to the problem that you're experiencing, is on the same kernel, and references several LKML threads on similar issues. This screams kernel issue, not Java issue to me, and is almost surely not an Elasticsearch issue. I am open and willing to be proven wrong, but it will require evidence.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.