Hi all,
I wonder if anyone's able to offer some advice, or pointers on troubleshooting an issue we have with a number of clients managed by Fleet.
We're running 7.13.0, and have just started deploying to Windows endpoints (~500 out of a possible ~4000, so far). During the initial deployment, we had Windows Perfmon/Service metrics being exported, but as they turned out to be hugely chatty after a few days, we disabled them as the focus was on Windows log ingest, not performance metrics. We also disabled the ingest of System Instance metrics.
The agents are (mainly) deployed via an SCCM package, and have Endpoint Security, System, and Windows integrations enabled.
Now, here's the issue. Most agents were deployed between Monday 27th September, and Thursday 30th September. We disabled Perfmon/Service metrics on Friday at 0945, and System Metrics on Friday at 1400 in Fleet. Within a few minutes, the whole stack started to receive ~100,000's events from endpoints with the following message -
[elastic_agent.metricbeat][error] elastic-agent-client got error: rpc error: code = Unavailable desc = connection error: desc = "transport: authentication handshake failed: remote error: tls: internal error"
The sheer number of events can be observed below -
The 'fix' for affected machines has been to restart them again once identified as experiencing the TLS internal error.
So far, we haven't been able to narrow down exactly 'why' some devices have experienced such an unexpected event. It's approximately 1 in 6 of the total ~500 deployed.
Is anyone able to suggest the best place to investigate this further? At the moment, we're at a loss to explain why disabling certain Integration features would cause such an unexpected result.
Best regards
Andy