Elastic Agent TLS Auth Handshake Failed - Internal Error

millap · October 4, 2021, 12:41pm

Hi all,

I wonder if anyone's able to offer some advice, or pointers on troubleshooting an issue we have with a number of clients managed by Fleet.

We're running 7.13.0, and have just started deploying to Windows endpoints (~500 out of a possible ~4000, so far). During the initial deployment, we had Windows Perfmon/Service metrics being exported, but as they turned out to be hugely chatty after a few days, we disabled them as the focus was on Windows log ingest, not performance metrics. We also disabled the ingest of System Instance metrics.

The agents are (mainly) deployed via an SCCM package, and have Endpoint Security, System, and Windows integrations enabled.

Now, here's the issue. Most agents were deployed between Monday 27th September, and Thursday 30th September. We disabled Perfmon/Service metrics on Friday at 0945, and System Metrics on Friday at 1400 in Fleet. Within a few minutes, the whole stack started to receive ~100,000's events from endpoints with the following message -

[elastic_agent.metricbeat][error] elastic-agent-client got error: rpc error: code = Unavailable desc = connection error: desc = "transport: authentication handshake failed: remote error: tls: internal error"

The sheer number of events can be observed below -

The 'fix' for affected machines has been to restart them again once identified as experiencing the TLS internal error.

So far, we haven't been able to narrow down exactly 'why' some devices have experienced such an unexpected event. It's approximately 1 in 6 of the total ~500 deployed.

Is anyone able to suggest the best place to investigate this further? At the moment, we're at a loss to explain why disabling certain Integration features would cause such an unexpected result.

Best regards
Andy

millap · October 6, 2021, 9:35am

Hi all,

Just to add a bit more context to this, if it helps. We've noticed two things.

On some of the Data Streams, the 'Last Activity' time is way off into the future (!) -

We're at a loss to explain why this would be, as other (active) Streams have a correct Last Activity time.

The following is also seen on the endpoint data for Metricbeat -

[elastic_agent][error] 2021-10-05T11:55:58+01:00 - message: Application: metricbeat--7.13.0[0f143740-6b2f-4ec1-b215-0c1e862fd9de]: State changed to CRASHED: exited with code: 1 - type: 'ERROR' - sub_type: 'FAILED'
[elastic_agent][info] Elastic Agent status changed to: 'online'
[elastic_agent][info] 2021-10-05T11:55:58+01:00 - message: Application: metricbeat--7.13.0[0f143740-6b2f-4ec1-b215-0c1e862fd9de]: State changed to STARTING: Starting - type: 'STATE' - sub_type: 'STARTING'
[elastic_agent][info] 2021-10-05T11:55:58+01:00 - message: Application: metricbeat--7.13.0[0f143740-6b2f-4ec1-b215-0c1e862fd9de]: State changed to RESTARTING: Restarting - type: 'STATE' - sub_type: 'STARTING'

Following that, the TLS errors start to appear, in huge numbers.

Could this be a bug in the way Metricbeat/Elastic Agent handle errors in 7.13.0?

Best regards
Andy

system · November 3, 2021, 11:35am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elastic Agent Enrollment Errors Elastic Security fleet	6	1738	March 25, 2022
Elastic Endpoint Security with Elastic Agent Endpoint Security	16	3251	November 10, 2020
Fleet agent Logs fleet	7	2865	April 13, 2022
Trouble setting up Elastic Agent Fleet server on self-managed Elasticsearch setup Beats fleet , elastic-agent	4	991	January 11, 2022
Fleet-server: http: server gave HTTP response to HTTPS client Kibana elastic-stack-security	26	5758	May 9, 2022

Elastic Agent TLS Auth Handshake Failed - Internal Error

Related topics