Orphaned agent is healthy in 9.0.1

Self hosted installation of ELK 9.0.1 with ~30 agents running:

On a Windows Server 2022 we deployed an elastic-agent which reports that everything is fine (elastic-agent status: fleet connected and agent healthy).

In Kibana Fleet shows this agent as "orphaned" with last checkin message: Running.
Logs of this agent show these 2 "errors":

*[elastic_agent][error] 2025-05-22 11:52:18: info: InstallLib.cpp:668 Installed endpoint is expected version (version: 9.0.1, compiled: Tue Apr 29 21:00:00 2025, branch: HEAD, commit: 36be778dc95d8f92217aed26425759e415111a22)*

*[elastic_agent][error] 2025-05-22 11:52:18: info: Util.cpp:2244 Endpoint Service is running.*

Should I report this directly on github or is there a known issue/workaround for that?

Those two errors are actually related to endpoint and not to the problem that you're experiencing. But I'm not sure we should be reporting those two lines as an error so I'm going to look at that.

I'll see if I can find someone who knows more about the orphaned status though to respond.

edit: I responded too quickly. Those lines endpoint logged as info correctly, agent reported them as error.

Thanks for letting us know. We've just recently run into similar issue on one test setup, where it happened after stack upgrade.

Was your stack recently upgraded to 9.0?

A little explanation

The Orphaned comes from audit written by "orphaned" Endpoint. The stack communicates with Elastic Endpoint via Elastic Agent. If Agent stops working Endpoint sends orphaned audit to clearly differentiate between Offline state, as otherwise such Agent would appear just offline.

We will continue to look for the root cause internally.

In the meantime I'd recommend you to check Endpoint service status

If all appears fine, on Agent and Endpoint services side, then it's only issue with resetting the audit, which we suspect it's the case.

Thanks for your input and explanations - yes, this stack was upgraded from 8.18.1 to 9.0.1.

For the Endpoint service status (everything appears fine..):

  1. Output of elastic-endpoint.exe status (json output can be provided if needed):
- elastic-agent
  - status: (HEALTHY) Connected
- elastic-endpoint
  - status: (HEALTHY) Running
  1. Screenshot Agent in Kibana->Fleet:

  1. elastic-endpoint test output reports all 3 connections with "Success"

Which option would you recommend us to do:

  • move the agent to a temporary policy without endpoint (and then back)?
  • shall we re-install the Agent?
  • is there a way to "reset the audit" for Endoint ourselves?
  • wait for the devs to figure it out & wait for a new version to be available?

It's not very convenient to fix the state. Do you have many endpoints/agents affected?

You can reset the audit for the affected Agent, but it requires document update. The agent audit document on .fleet-agents index contains unenrolled reason/time which is causing the issue. However to delete the nodes a document update has to be made, as you know Elasticsearch doesn't have a query syntax to just delete/alter node of a document.

          "audit_unenrolled_reason": "orphaned",
          "audit_unenrolled_time": "2025-05-26T19:24:48Z",

I've been in touch with the team which will deliver the fix. The issue is under investigation. One corner case causing this has been already found.

2 Likes

Currently we have three affected agents and thank you very much for your answer.

I will not touch the .fleet-agents index and wait patiently for a fixed version - after all the issue looks only cosmetical to me - functionality is not impacted.

1 Like

is there any schedule for a fix?

Since I found no convenient or supported way to get elastic-agents out of “stuck” or “erroneously displayed” states in kibana→fleet→agents, I did it inconveniently and perhaps unsupportedly this way (thanks @lesio for pointing me this direction):

Disclaimer: don’t try this on your production ELK.. I guess..

  1. Get yourself some privileges on an internal, hidden system-index:

  1. Discover this index - e.g.:

  1. Filter for specific Agents - e.g. via “local_metadata.host.hostname”:

  1. Delete all ancient, antique, old or not-recent documents from the index (e.g. all docs except the last one in above screenshot..)

  2. Fix the current document - painlessly (but highly discouraged..) until it looks equivalent to the agent’s real state (which you should check locally - we often see already upgraded agents on local systems that are displayed with a lower version in kibana - resisting every upgrade attempt via fleet..)

e.g.: clear the “audit_unenrolled_reason”:

POST .fleet-agents/_update_by_query
{
    "query": {
        "term": {
            "agent.id": "2x7x44f4-7954-4478-xxxx-2c07xx907f17"
                }
            },   
    "script": {
        "source": "ctx._source.audit_unenrolled_reason = null;",
        "lang": "painless"
              }
}

I cleared the “orphaned” string last - after resetting all possibly incorrect date-fields (using painless like above..)

btw: I hope we get a convenient and supported way to fix this when the future major-version comes along:

#! this request accesses system indices: [.fleet-agents-7], but in a future major version, direct access to system indices will be prevented by default
1 Like

unfortunately this topic has been solved but the issue still exists?

I do not think that it is topic can be ignored only because of an existing workarround.

Is there any recent info from DEV-Team about how to solve this without having to edit system index info?

i agree also experiecing the same issue