We have converted an agent policy to use the Logstash ingest path instead of direct to the cluster. Events are being received by the logstash, and they appear in Elastic, so the end to end path is working ok.
However the agent status remains at unhealthy, and the following errors are appearing in the agent log:
[elastic_agent][error] 2022-12-21T23:41:26Z - message: Application: endpoint-security--8.5.2[e41402d7-e5af-4e48-8b58-ab8bff40d216]: State changed to FAILED: failed to start connection credentials listener: listen tcp 127.0.0.1:6788: bind: Only one usage of each socket address (protocol/network address/port) is normally permitted. - type: 'ERROR' - sub_type: 'FAILED'
[elastic_agent][error] failed to dispatch actions, error: operator: failed to execute step sc-run, error: failed to start connection credentials listener: listen tcp 127.0.0.1:6788: bind: Only one usage of each socket address (protocol/network address/port) is normally permitted.: failed to start connection credentials listener: listen tcp 127.0.0.1:6788: bind: Only one usage of each socket address (protocol/network address/port) is normally permitted.
Anyone have any idea what is going on here? The agent has been stopped/started, and even rebooted the host.
Host is Windows 2022, agent version is 8.5.2
Some more info...
If I change the agent policy to the original one without the logstash output, the error does not appear, and the agent status is healthy. Changing back to the agent policy with the logstash output causes the error. Logic says the error has to have something to do with the logstash bit, but the error message doesn't hint at that, and the normal agent events are getting to the elastic cluster via the logstash server - and the event count is going up on the logstash server.
Curiouser and curiouser.
running elastic-agent status gives this line of interesting output:
* endpoint-security (FAILED)
failed to start connection credentials listener: listen tcp 127.0.0.1:6788: bind: Only one usage of each socket address (protocol/network address/port) is normally permitted.
So it appears that the endpoint-security integration does not work well when a logstash output is enabled, but is ok with direct output:
* endpoint-security (HEALTHY)
Protecting with policy {3ed69483-78c0-47c0-bc11-9b713301ffbd}
Endpoint Security supports Logstash output (since 8.3, I think). I suspect the issue is you have two default outputs set, one for Logstash and one for Elasticsearch. When the Logstash output is removed everything is healthy because there is only one output, so the "which output should be used" problem doesn't arise. If you remove the Elasticsearch output and leave just Logstash the problem should go away and you should see Endpoint Security events appear in Elasticsearch (an easy search for them is event.module : endpoint).
I think this is a bug, I have a support case open for this issue. I think when you switch Outputs in the Fleet UI, the Elastic Agent tries to spin up a new instance of Endpoint before taking the old one down. This causes a loop where Elastic Endpoint is never able to properly update to the correct output (because the Endpoint requires a static socket that isn't freed up by the old Endpoint instance).
The workaround I've found is:
Remove Endpoint Integration from Agent policy
Wait for change to rollout to Agents (this will stop the "old" Endpoint security process/instance)
Switch to Logstash output
Re-add the Endpoint Integration to the Agent policy
Bingo!! This is exactly what I had to do last night. More debugging showed me that the agent was trying to start a second instance of the Endpoint integration, which was what was generating the original error message. Strange that this only happened when the output was logstash instead of default.
My agents got terribly confused so I had to:
switch the policy back to default output, and wait for it to take effect
remove the endpoint integration from the policy and wait for it to take effect
-- but the endpoint software was still running, it would not shut down
reboot the machine the agent was running on
-- this got rid of the endpoint software
switch the policy back to using logstash output, and wait for it to take effect
add the endpoint integration back to the policy, and wait for it to take effect
There seems to be and underlying issue where the agent doesn't handle the stop/start of the
endpoint integration properly.
Thanks for the help - its good to know that I wasn't going mad!!
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.