Unhealthy status yet sending events - agent via logstash

We have converted an agent policy to use the Logstash ingest path instead of direct to the cluster. Events are being received by the logstash, and they appear in Elastic, so the end to end path is working ok.
However the agent status remains at unhealthy, and the following errors are appearing in the agent log:

[elastic_agent][error] 2022-12-21T23:41:26Z - message: Application: endpoint-security--8.5.2[e41402d7-e5af-4e48-8b58-ab8bff40d216]: State changed to FAILED: failed to start connection credentials listener: listen tcp 127.0.0.1:6788: bind: Only one usage of each socket address (protocol/network address/port) is normally permitted. - type: 'ERROR' - sub_type: 'FAILED'

[elastic_agent][error] failed to dispatch actions, error: operator: failed to execute step sc-run, error: failed to start connection credentials listener: listen tcp 127.0.0.1:6788: bind: Only one usage of each socket address (protocol/network address/port) is normally permitted.: failed to start connection credentials listener: listen tcp 127.0.0.1:6788: bind: Only one usage of each socket address (protocol/network address/port) is normally permitted.

Anyone have any idea what is going on here? The agent has been stopped/started, and even rebooted the host.
Host is Windows 2022, agent version is 8.5.2

Ross

Some more info...
If I change the agent policy to the original one without the logstash output, the error does not appear, and the agent status is healthy. Changing back to the agent policy with the logstash output causes the error. Logic says the error has to have something to do with the logstash bit, but the error message doesn't hint at that, and the normal agent events are getting to the elastic cluster via the logstash server - and the event count is going up on the logstash server.
Curiouser and curiouser.

Another update

running elastic-agent status gives this line of interesting output:

* endpoint-security      (FAILED)
                          failed to start connection credentials listener: listen tcp 127.0.0.1:6788: bind: Only one usage of each socket address (protocol/network address/port) is normally permitted.

So it appears that the endpoint-security integration does not work well when a logstash output is enabled, but is ok with direct output:

  * endpoint-security      (HEALTHY)
                           Protecting with policy {3ed69483-78c0-47c0-bc11-9b713301ffbd}

Endpoint Security supports Logstash output (since 8.3, I think). I suspect the issue is you have two default outputs set, one for Logstash and one for Elasticsearch. When the Logstash output is removed everything is healthy because there is only one output, so the "which output should be used" problem doesn't arise. If you remove the Elasticsearch output and leave just Logstash the problem should go away and you should see Endpoint Security events appear in Elasticsearch (an easy search for them is event.module : endpoint).

I hope that helps.

I think this is a bug, I have a support case open for this issue. I think when you switch Outputs in the Fleet UI, the Elastic Agent tries to spin up a new instance of Endpoint before taking the old one down. This causes a loop where Elastic Endpoint is never able to properly update to the correct output (because the Endpoint requires a static socket that isn't freed up by the old Endpoint instance).

The workaround I've found is:

  1. Remove Endpoint Integration from Agent policy
  2. Wait for change to rollout to Agents (this will stop the "old" Endpoint security process/instance)
  3. Switch to Logstash output
  4. Re-add the Endpoint Integration to the Agent policy

Bingo!! This is exactly what I had to do last night. More debugging showed me that the agent was trying to start a second instance of the Endpoint integration, which was what was generating the original error message. Strange that this only happened when the output was logstash instead of default.
My agents got terribly confused so I had to:

  • switch the policy back to default output, and wait for it to take effect
  • remove the endpoint integration from the policy and wait for it to take effect
    -- but the endpoint software was still running, it would not shut down
  • reboot the machine the agent was running on
    -- this got rid of the endpoint software
  • switch the policy back to using logstash output, and wait for it to take effect
  • add the endpoint integration back to the policy, and wait for it to take effect

There seems to be and underlying issue where the agent doesn't handle the stop/start of the
endpoint integration properly.

Thanks for the help - its good to know that I wasn't going mad!!

Ross

1 Like

FYI, Elastic opened a formal issue for this bug: Some policy updates can cause duplicate Endpoint processes · Issue #2008 · elastic/elastic-agent · GitHub

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.