Elastic Agent TLS Auth Handshake Failed - Internal Error

Hi all,

I wonder if anyone's able to offer some advice, or pointers on troubleshooting an issue we have with a number of clients managed by Fleet.

We're running 7.13.0, and have just started deploying to Windows endpoints (~500 out of a possible ~4000, so far). During the initial deployment, we had Windows Perfmon/Service metrics being exported, but as they turned out to be hugely chatty after a few days, we disabled them as the focus was on Windows log ingest, not performance metrics. We also disabled the ingest of System Instance metrics.

The agents are (mainly) deployed via an SCCM package, and have Endpoint Security, System, and Windows integrations enabled.

Now, here's the issue. Most agents were deployed between Monday 27th September, and Thursday 30th September. We disabled Perfmon/Service metrics on Friday at 0945, and System Metrics on Friday at 1400 in Fleet. Within a few minutes, the whole stack started to receive ~100,000's events from endpoints with the following message -

[elastic_agent.metricbeat][error] elastic-agent-client got error: rpc error: code = Unavailable desc = connection error: desc = "transport: authentication handshake failed: remote error: tls: internal error"

The sheer number of events can be observed below -

The 'fix' for affected machines has been to restart them again once identified as experiencing the TLS internal error.

So far, we haven't been able to narrow down exactly 'why' some devices have experienced such an unexpected event. It's approximately 1 in 6 of the total ~500 deployed.

Is anyone able to suggest the best place to investigate this further? At the moment, we're at a loss to explain why disabling certain Integration features would cause such an unexpected result.

Best regards
Andy

Hi all,

Just to add a bit more context to this, if it helps. We've noticed two things.

On some of the Data Streams, the 'Last Activity' time is way off into the future (!) -

We're at a loss to explain why this would be, as other (active) Streams have a correct Last Activity time.

The following is also seen on the endpoint data for Metricbeat -

[elastic_agent][error] 2021-10-05T11:55:58+01:00 - message: Application: metricbeat--7.13.0[0f143740-6b2f-4ec1-b215-0c1e862fd9de]: State changed to CRASHED: exited with code: 1 - type: 'ERROR' - sub_type: 'FAILED'
[elastic_agent][info] Elastic Agent status changed to: 'online'
[elastic_agent][info] 2021-10-05T11:55:58+01:00 - message: Application: metricbeat--7.13.0[0f143740-6b2f-4ec1-b215-0c1e862fd9de]: State changed to STARTING: Starting - type: 'STATE' - sub_type: 'STARTING'
[elastic_agent][info] 2021-10-05T11:55:58+01:00 - message: Application: metricbeat--7.13.0[0f143740-6b2f-4ec1-b215-0c1e862fd9de]: State changed to RESTARTING: Restarting - type: 'STATE' - sub_type: 'STARTING'

Following that, the TLS errors start to appear, in huge numbers.

Could this be a bug in the way Metricbeat/Elastic Agent handle errors in 7.13.0?

Best regards
Andy