Unhealthy status yet sending events - agent via logstash

rossw · December 21, 2022, 11:46pm

We have converted an agent policy to use the Logstash ingest path instead of direct to the cluster. Events are being received by the logstash, and they appear in Elastic, so the end to end path is working ok.
However the agent status remains at unhealthy, and the following errors are appearing in the agent log:

[elastic_agent][error] 2022-12-21T23:41:26Z - message: Application: endpoint-security--8.5.2[e41402d7-e5af-4e48-8b58-ab8bff40d216]: State changed to FAILED: failed to start connection credentials listener: listen tcp 127.0.0.1:6788: bind: Only one usage of each socket address (protocol/network address/port) is normally permitted. - type: 'ERROR' - sub_type: 'FAILED'

[elastic_agent][error] failed to dispatch actions, error: operator: failed to execute step sc-run, error: failed to start connection credentials listener: listen tcp 127.0.0.1:6788: bind: Only one usage of each socket address (protocol/network address/port) is normally permitted.: failed to start connection credentials listener: listen tcp 127.0.0.1:6788: bind: Only one usage of each socket address (protocol/network address/port) is normally permitted.

Anyone have any idea what is going on here? The agent has been stopped/started, and even rebooted the host.
Host is Windows 2022, agent version is 8.5.2

Ross

rossw · December 22, 2022, 2:06am

Some more info...
If I change the agent policy to the original one without the logstash output, the error does not appear, and the agent status is healthy. Changing back to the agent policy with the logstash output causes the error. Logic says the error has to have something to do with the logstash bit, but the error message doesn't hint at that, and the normal agent events are getting to the elastic cluster via the logstash server - and the event count is going up on the logstash server.
Curiouser and curiouser.

rossw · December 22, 2022, 3:19am

Another update

running elastic-agent status gives this line of interesting output:

* endpoint-security      (FAILED)
                          failed to start connection credentials listener: listen tcp 127.0.0.1:6788: bind: Only one usage of each socket address (protocol/network address/port) is normally permitted.

So it appears that the endpoint-security integration does not work well when a logstash output is enabled, but is ok with direct output:

  * endpoint-security      (HEALTHY)
                           Protecting with policy {3ed69483-78c0-47c0-bc11-9b713301ffbd}

ferullo · December 22, 2022, 2:22pm

Endpoint Security supports Logstash output (since 8.3, I think). I suspect the issue is you have two default outputs set, one for Logstash and one for Elasticsearch. When the Logstash output is removed everything is healthy because there is only one output, so the "which output should be used" problem doesn't arise. If you remove the Elasticsearch output and leave just Logstash the problem should go away and you should see Endpoint Security events appear in Elasticsearch (an easy search for them is event.module : endpoint).

I hope that helps.

BenB196 · December 22, 2022, 3:35pm

I think this is a bug, I have a support case open for this issue. I think when you switch Outputs in the Fleet UI, the Elastic Agent tries to spin up a new instance of Endpoint before taking the old one down. This causes a loop where Elastic Endpoint is never able to properly update to the correct output (because the Endpoint requires a static socket that isn't freed up by the old Endpoint instance).

The workaround I've found is:

Remove Endpoint Integration from Agent policy
Wait for change to rollout to Agents (this will stop the "old" Endpoint security process/instance)
Switch to Logstash output
Re-add the Endpoint Integration to the Agent policy

rossw · December 22, 2022, 7:53pm

Bingo!! This is exactly what I had to do last night. More debugging showed me that the agent was trying to start a second instance of the Endpoint integration, which was what was generating the original error message. Strange that this only happened when the output was logstash instead of default.
My agents got terribly confused so I had to:

switch the policy back to default output, and wait for it to take effect
remove the endpoint integration from the policy and wait for it to take effect
-- but the endpoint software was still running, it would not shut down
reboot the machine the agent was running on
-- this got rid of the endpoint software
switch the policy back to using logstash output, and wait for it to take effect
add the endpoint integration back to the policy, and wait for it to take effect

There seems to be and underlying issue where the agent doesn't handle the stop/start of the
endpoint integration properly.

Thanks for the help - its good to know that I wasn't going mad!!

Ross

BenB196 · December 28, 2022, 3:21pm

FYI, Elastic opened a formal issue for this bug: Some policy updates can cause duplicate Endpoint processes · Issue #2008 · elastic/elastic-agent · GitHub

system · January 25, 2023, 3:21pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elastic agent transfer via logstahsh with incomplete logs Elastic Agent	4	315	March 28, 2023
Troubleshoot Elastic Endpoint Unhealthy Endpoint Security	5	553	November 6, 2023
Agent "Unhealthy". "Error while dialing open \\\.\\pipe\\elastic-agent-[...]" Elastic Security fleet	10	5320	December 2, 2021
Unhealthy agent status with failed policy status - agent 8.11.4 Elastic Security	3	138	June 17, 2024
Elastic agent unhealthy because of elastic defend integration Elastic Security	6	2344	September 23, 2023

Unhealthy status yet sending events - agent via logstash

Related topics