Linux agent system hang / disk IO stall

Hi,

I'm seeing an issue with two systems that cease functioning after a while if the agent is active.
I've watched it happen for 2-3 months now, and gone through some tests like disabling prevention for one of them.
I'm on 8.7.0, I think it started around 8.6, after a few months running 8.5 without issues.

The behaviour something like this:

  • console shows hung task timeouts related to defend or some kernel stuffs
  • can still login, system useable
  • system stops processing most tasks related to IO
  • can still login and can't run processes
  • after a certain point, login is no longer possible
  • kvm vms are still functioning at this point
  • docker containers are still functioning at this point

as far as I grasp it, the IO is blocked on the kernel side, there's no disk issue or such. Since it'll hang on IO, at some point you reach overload conditions

I've verified that this happens with elastic agent installed + enrolled, and does not happen if it's uninstalled.

  • system A is an ubuntu kvm host, that one has kernel live patching
  • system B is a OEL8 zimbra mail server that has no special things (i think)
  • system A and system B have a synology backup agent (since before times)
  • I cannot see any corresponding alerts in Kibana

Still trying to get an IPMI snapshot of system A
On system B i can usually see errors like the following

I'm trying to find a way to avoid these issues, one of the worst parts being that it happens even in detect.
FYI I have only about 15 systems to test with, so the scale is too low to expose patterns.

Hi! What kernel version are the affected machines using?

Based on your description this sounds a lot like one rare issue we've observed which is related to the kernel feature - fanotify - that we are using for file events. It would cause an unrecoverable system lockup in some rare conditions, for example if we happen to crash while holding a fanotify descriptor or are SIGKILLed with kill -9 without being able to clean up system resources properly. Another potential trigger condition is when OOM killer kills our process to reclaim memory.

At around kernel version 5.1 an improvement to the fanotify subsystem in the kernel was committed that should alleviate the issue but we haven't been able to fully confirm it (it's quite tough to reproduce).

We also improved our fanotify handling to decrease the probability of (or perhaps eliminate) the issue occurring for pre-v5.1 kernels. This improvement landed in 8.8 and we backported it to 7.17 as well. If you are able to try a new version of the Elastic stack, we would recommend trying it.

Are any of the machines running SELinux? We suspected before that some clash with certain SELinux operations might be causing the fanotify hang to occur.

If it's possible, please share more of the related logs, especially the task timeouts you've seen or any other unusual errors on the kernel side, that way we can do better root cause analysis. Endpoint logs could also be useful, you can find them in /opt/Elastic/Endpoint/state/log directory.

I will DM you a private upload link for the logs.

If upgrading the stack is not possible at the moment, you could also try disabling Malware Protection and ensuring advanced policy options related to fanotify (linux.advanced.fanotify.*) are empty.

1 Like

Here's a table of the kernel versions and SELinux states:

host role os selinux kernel version notes
system a kvm host ubuntu 18.04 no 4.15.0-76 livepatch via ksplice is active on this one, i have uninstalled elastic here for the moment. I can reinstall. the impact is low since the vms/containers stay unimpacted, but i'll have to zap power whenever i need to connect to the host again :confused:
system b vm, zimbra mail oel8 disabled 5.4.17-2136 (uek) this was supposed to run stock EL kernel, boot default was wrong, also runs openvpn to frontend NAT gw with public IP

Once I have an understanding of the logs you need, I'll re-enable on the KVM host so you can have regular, uh.... irregularities.

With some effort (2-10h range) I can isolate at least the KVM host to let you have full access; if you have the ressources to make good use of it.

Generally I can hold back on updating to get you more data, and i'll make some plan to deploy with 8.8. for the non-profit where i really, really wanna get ES into action.

1 Like

Thank you for the data, this will be very helpful. So can you confirm that the lockup occurred on a host with 5.4 kernel? That would imply the kernel patch in 5.1 does not fully fix the issue.

I think full access is not necessary, it should be enough to retrieve logs after the issue has occurred. I understand the system might already be locked up so extracting logs would not be easy, previously when we saw it we ended up reattaching the drive to another machine to backup the logs and inspect them.

I think dmesg, systemd journalctl plus Endpoint logs (and maybe Elastic Agent logs) would be enough, you can also include other logs that you think could be useful or that show unexpected errors that could be related. It would be interesting to know if Endpoint is still running after the lockup happened. In one case where we had the issue occur, it was due to a SIGSEGV that Endpoint had triggered, which we saw evidence of in audit.log so you could look in Audit system logs for such traces too.

Hi,

The screenshot was from system B (so 5.4 kernel, i'll check if / when the kernel was last installed)

I'll think it over, i think i can try to enable netconsole so we get a coherent look at whats going on.
I'll ping back once it got stuck again (it's my mail server, so at some point it'll always be noticed :slight_smile:

edit: good news, the kernel was likely installed on 2023-06-10, last reboot due to hang was 5d ago.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.