Network Disruption on Kubernetes Node with Elastic Security Integration on Debian

Hello,

I am experiencing a issue where the private Kubernetes network fails on a node after setting up the Elastic Security integration with the Elastic Agent directly on a Debian server, which runs in parallel with another Elastic Agent operating within my k8s cluster. The issue seems to arise specifically when the Elastic Agent is installed directly on the Debian server, affecting the network functionality on the corresponding k8s node.

Environment Details:

  • Elastic Stack Version: 8.12.1
  • Operating System: Debian 12
  • Kernel Version: 6.1.0-18
  • Kubernetes Version: 1.27.8
  • Cilium Version: 1.14.5
  • Deployment: Kubernetes with Fleet Server for agent management
  • Observation: One agent is running within the Kubernetes cluster with only the Kubernetes integration and functions without issues. The problem occurs when another Elastic Agent with the Elastic Security integration is installed on the Debian server.

Symptom:

  • After installing the Elastic Agent with Elastic Security on the Debian server, the k8s private network on the affected node stops functioning correctly. The temporary workaround to restore network functionality is to restart the Cilium pod, but this fix is only temporary, as the network issues reoccur within minutes to hours.

Steps to Reproduce:

  1. Ensure an Elastic Agent with Kubernetes integration is running within a k8s cluster.
  2. Install another Elastic Agent with Elastic Security integration on a Debian server, ensuring to change the gRPC port to avoid conflicts with the Kubernetes agent.
  3. Observe the disruption in the k8s private network functionality on the node associated with the Debian server.

Actual Behavior: The k8s private network on the node associated with the Debian server fails, causing significant operational issues. The only temporary remedy found is to restart the Cilium pod, which only provides a short-term solution as the network issues recur after some time.

Additional Context:

  • The issue does not manifest in the Kubernetes agent running the Kubernetes integration.
  • This problem seems to be specifically triggered by the Elastic Security integration's, the other integrations are trouble-free and work properly
  • Unfortunately, I don't have any interesting logs to share, because despite the private network no longer working, the integration is working properly.

I am seeking assistance to resolve this network disruption issue, which seems to be tied to the specific setup of Elastic Agent with Elastic Security on a Debian server parallel to a Kubernetes environment. Any insights, suggestions, or solutions to prevent the k8s network from failing would be greatly appreciated.

Thank you.

Thanks for all those details!

It might be an interaction between Endpoint and Cillium that's causing this. Can you try disabling Endpoint's support for host isolation? Normally on systems where host isolation is possible, Endpoint loads some ebpf probes on start up to be ready to isolate the host.

To disable that, go to the appropriate Elastic Defend policy, click "Show advanced settings" at the bottom of the page, then set linux.advanced.host_isolation.allowed to false. While that setting should take effect as soon as the policy is saved and applied to the host, you might need to remove the Elastic Defend integration, verify networking works again, then re-add the integration and see if networking now continues to work.

I hope that fixes this issue. If it does the only thing that will be disabled is host isolation. Protections and network events will still work.

1 Like

Thanks for the answer, it works perfectly

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.