Issue between Fleet-managed Elastic agent on external server and Fleet in k8s

Hi,

My fleet-managed agent can't send logs (they are always dropped) to ES, but does send the metrics. The error message seen in the agent's status is a 504 Gateway Time-out.

Additionally, we see this message in the agent's log (in debug mode only) :

action [indices:data/write/bulk[s]] is unauthorized for API key id [API_KEY_ID] of user [elastic/fleet-server] on indices [logs-system-syslog], this action is granted by the index privileges [create_doc,create,delete,index,write,all]

It proves that the API key generated by fleet has unsufficient rights to push the logs in ES.

This behavior appear when the agent is managed by fleet. When he is in standalone mode, there is no problem, and all logs are sent (same for the metrics) without error. However, we would like to enable fleet to manage our agents

Stack Description

I have an ELK stack in a Kubernetes Cluster. All the Stack is in 8.13.2 (same for the agent and fleet). The Fleet server runs in a pod, and our agents are running on bare-metal servers outside of the K8S cluster.

Fleet have is own policy and the agents have there own policy. The fleet policy has a Fleet server Integration with the default parameters.

The agent has a System Integration installed with a slightly modified configuration which works.

More detailled description

My agent is installed on a debian server and I followed the elastic doc to install it (Install standalone elastic agent). I enrolled my agent with the following command :

sudo elastic-agent enroll --url=FLEET_URL --enrollment-token=AGENT_POLICY_ENROLLEMENT_TOKEN --force

During 2 - 3 minutes after the enrollement the agent is connected to fleet and healthy, or sometimes it takes a long time to connect to fleet and fleet is in STARTING mod (instead of connected) before failed. After this time fleet became inaccessible. However the agent correctly appear in the agent section in the Fleet management in Kibana. The agent's state switch between Healthy and Offline.

Here the status of my agent after few minutes :

$ sudo elastic-agent status

┌─ fleet
│ └─ status: (FAILED) could not decode the response, raw response: <html><body><h1>504 Gateway Time-out</h1>
│ The server didn't respond in time.
│ </body></html>
│
│
└─ elastic-agent
└─ status: (HEALTHY) Running

During and after the few minutes, in the logs, there is a lot of dropping events. When I set the debug mod on my agent (via Kibana and after restarting the agent).

I have the following error when he drops the events :

action [indices:data/write/bulk[s]] is unauthorized for API key id [API_KEY_ID] of user [elastic/fleet-server] on indices [logs-system-syslog], this action is granted by the index privileges [create_doc,create,delete,index,write,all]

Having the same issue on Elasticsearch 8.16.1, I am using ECK, same scenario, Fleet server in kubernetes whereas agents are bare metal servers, and haproxy in the middle. Did you manage to solve this?

Thanks

I have found a fix, in my case it wasn't the external haproxy load balancer but rather the kubernetes ingress controller (haproxy) that was timing out.
I have read on github that fleet uses long lived connections, therefore I had to increase the ingress timeouts to keep it stable.

    ingress.kubernetes.io/timeout-client: 24d
    ingress.kubernetes.io/timeout-connect: 24d
    ingress.kubernetes.io/timeout-http-request: 24d
    ingress.kubernetes.io/timeout-server: 24d