Linux agent system hang / disk IO stall

Florian_Heigl · July 20, 2023, 11:28am

Hi,

I'm seeing an issue with two systems that cease functioning after a while if the agent is active.
I've watched it happen for 2-3 months now, and gone through some tests like disabling prevention for one of them.
I'm on 8.7.0, I think it started around 8.6, after a few months running 8.5 without issues.

The behaviour something like this:

console shows hung task timeouts related to defend or some kernel stuffs
can still login, system useable
system stops processing most tasks related to IO
can still login and can't run processes
after a certain point, login is no longer possible
kvm vms are still functioning at this point
docker containers are still functioning at this point

as far as I grasp it, the IO is blocked on the kernel side, there's no disk issue or such. Since it'll hang on IO, at some point you reach overload conditions

I've verified that this happens with elastic agent installed + enrolled, and does not happen if it's uninstalled.

system A is an ubuntu kvm host, that one has kernel live patching
system B is a OEL8 zimbra mail server that has no special things (i think)
system A and system B have a synology backup agent (since before times)
I cannot see any corresponding alerts in Kibana

Still trying to get an IPMI snapshot of system A
On system B i can usually see errors like the following

I'm trying to find a way to avoid these issues, one of the worst parts being that it happens even in detect.
FYI I have only about 15 systems to test with, so the scale is too low to expose patterns.

Michal_Stanek · July 20, 2023, 9:02pm

Hi! What kernel version are the affected machines using?

Based on your description this sounds a lot like one rare issue we've observed which is related to the kernel feature - fanotify - that we are using for file events. It would cause an unrecoverable system lockup in some rare conditions, for example if we happen to crash while holding a fanotify descriptor or are SIGKILLed with kill -9 without being able to clean up system resources properly. Another potential trigger condition is when OOM killer kills our process to reclaim memory.

At around kernel version 5.1 an improvement to the fanotify subsystem in the kernel was committed that should alleviate the issue but we haven't been able to fully confirm it (it's quite tough to reproduce).

We also improved our fanotify handling to decrease the probability of (or perhaps eliminate) the issue occurring for pre-v5.1 kernels. This improvement landed in 8.8 and we backported it to 7.17 as well. If you are able to try a new version of the Elastic stack, we would recommend trying it.

Are any of the machines running SELinux? We suspected before that some clash with certain SELinux operations might be causing the fanotify hang to occur.

If it's possible, please share more of the related logs, especially the task timeouts you've seen or any other unusual errors on the kernel side, that way we can do better root cause analysis. Endpoint logs could also be useful, you can find them in /opt/Elastic/Endpoint/state/log directory.

I will DM you a private upload link for the logs.

If upgrading the stack is not possible at the moment, you could also try disabling Malware Protection and ensuring advanced policy options related to fanotify (linux.advanced.fanotify.*) are empty.

Florian_Heigl · July 20, 2023, 10:13pm

Here's a table of the kernel versions and SELinux states:

host	role	os	selinux	kernel version	notes
system a	kvm host	ubuntu 18.04	no	4.15.0-76	livepatch via ksplice is active on this one, i have uninstalled elastic here for the moment. I can reinstall. the impact is low since the vms/containers stay unimpacted, but i'll have to zap power whenever i need to connect to the host again
system b	vm, zimbra mail	oel8	disabled	5.4.17-2136 (uek)	this was supposed to run stock EL kernel, boot default was wrong, also runs openvpn to frontend NAT gw with public IP

Once I have an understanding of the logs you need, I'll re-enable on the KVM host so you can have regular, uh.... irregularities.

With some effort (2-10h range) I can isolate at least the KVM host to let you have full access; if you have the ressources to make good use of it.

Generally I can hold back on updating to get you more data, and i'll make some plan to deploy with 8.8. for the non-profit where i really, really wanna get ES into action.

Michal_Stanek · July 20, 2023, 11:04pm

Thank you for the data, this will be very helpful. So can you confirm that the lockup occurred on a host with 5.4 kernel? That would imply the kernel patch in 5.1 does not fully fix the issue.

I think full access is not necessary, it should be enough to retrieve logs after the issue has occurred. I understand the system might already be locked up so extracting logs would not be easy, previously when we saw it we ended up reattaching the drive to another machine to backup the logs and inspect them.

I think dmesg, systemd journalctl plus Endpoint logs (and maybe Elastic Agent logs) would be enough, you can also include other logs that you think could be useful or that show unexpected errors that could be related. It would be interesting to know if Endpoint is still running after the lockup happened. In one case where we had the issue occur, it was due to a SIGSEGV that Endpoint had triggered, which we saw evidence of in audit.log so you could look in Audit system logs for such traces too.

Florian_Heigl · July 20, 2023, 11:40pm

Hi,

The screenshot was from system B (so 5.4 kernel, i'll check if / when the kernel was last installed)

I'll think it over, i think i can try to enable netconsole so we get a coherent look at whats going on.
I'll ping back once it got stuck again (it's my mail server, so at some point it'll always be noticed

edit: good news, the kernel was likely installed on 2023-06-10, last reboot due to hang was 5d ago.

system · August 17, 2023, 11:41pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
ElasticSearch hangs/freezes EC2 box Elasticsearch	2	360	July 6, 2017
Fmutex hanging issue with elasticsearch + rabbitmq river + large dataset (>100M records) Elasticsearch	8	1839	July 6, 2017
ElasticSearch crashes OS? Elasticsearch	13	475	July 6, 2017
ElasticSearch becomes unresponsive Elasticsearch	2	719	July 6, 2017
Elasticsearch process causes CPU soft lockup (causing the server to hung) Elasticsearch	4	1759	June 18, 2018

Linux agent system hang / disk IO stall

Related topics