We are doing a PoC with the Elastic Agent and one of our agent host in this scenario became UNHEALTHY after an upgrade.

We have the following ingestion flow:

Elastic Agent -> HAProxy (passthrough) -> Logstash -> Elasticsearch

And currently we have 3 different policies, one for Linux workstations, one for Linux servers, and one for Windows workstations and is this last one that is not working right.

I requested the file for this agent and looking at the endpoint service log it says that it cannot connect to the Logstash server, which does not make much sense because no change was made on the network.

The error is not helpful at all:

{"@timestamp":"2023-10-06T14:47:30.6465521Z","agent":{"id":"03ef0b8d-2d54-4d72-94a7-70189dae65d0","type":"endpoint"},"ecs":{"version":"1.11.0"},"log":{"level":"error","origin":{"file":{"line":662,"name":"LogstashClient.cpp"}}},"message":"LogstashClient.cpp:662 SSL handshake with Logstash server at HAPROXY-IP:5046 encountered an error: (null)","process":{"pid":5172,"thread":{"id":7088}}}

It is complaining about SSL Handshake with the Logstash server and the error is (null), not sure what is happening.

This started after we upgraded the Agent from Fleet UI.

This same ingestion flow works for all the Linux machines, the difference in the policies are only the logstash port.

In the Endpoint screen in Kibana it says that the windows agent has an out-of-date policy, so I'm assuming something didn't worked as expected during the upgrade.

What path should I use to approach this troubleshoot?

Why don't you ingest the data directly into elasticsearch or instead of logstach and then elasticsearch? Are you using a self-signed certificate? You can try inserting the don't validate certificate tag in Elastic Agent. Another thing is to analyze, on the fleet server, whether there is also incompatibility in any integration of your policy.

We need to use Logstash, only Logstash servers are allowed to connect to the Elasticsearch servers, this is not an issue.

Everything worked fine, the issue only happens for a single Agent, the only one on Windows, after the Upgrade to version 8.10.2.

Since we have a license, we opened a ticket with elastic, it looks like some conflict with our VPN application, as it is intermitent.

Another possibility is to use wireshark to analyze traffic and try to understand the behavior of this communication. When executing the telnet iplogstash port command, is the connection closed normally?

It is not a connection issue, the connection works, a telnet works, the certificate works, only one agent running windows that has this issue after the upgrade.

It is intermitent and we are investigating a conflict with our VPN client, the agent seems to have some issue related to network.

Since I already opened a ticket I will mark this a concluded.

Thanks anyway @wsouza !

