Elastic Agent 7.14 on-prim Elastic cluster. Effected OS Windows Server 2016/2019 both OS's have the same issue.
Error pulled from a failed enrollment attempt of the client directly.
Fleet agent in prod is running on CentOS Stream. Test Fleet server was Ubuntu.
Steps used:
- Remove 7.13.4
- Install windows updates.
- Reboot server.
- Delete the existing Agent folder from C:\Pro~\Elastic\Agent
- Enroll agent from the 7.14 download.
- Start chanting no whammies, no whammies, no whammies.
- If whammy occurs go to Fleet unenroll failed agent. If you attempt to re-run the enrollment again you will have duplicate agents and it will be stuck in updating state forever.
- Restart fleet server agent. Reenroll.
//
2021-08-16T11:46:18.366-0700 ERROR cmd/watch.go:61 failed to load markeropen C:\Program Files\Elastic\Agent\data.update-marker: The system cannot find the file specified.
2021-08-16T11:46:18.591-0700 INFO [composable.providers.docker] docker/docker.go:43 Docker provider skipped, unable to connect: protocol not available
2021-08-16T11:46:18.593-0700 INFO [api] api/server.go:62 Starting stats endpoint
2021-08-16T11:46:18.594-0700 INFO application/managed_mode.go:291 Agent is starting
2021-08-16T11:46:18.594-0700 INFO [api] api/server.go:64 Metrics endpoint listening on: \.\pipe\elastic-agent (configured: npipe:///elastic-agent)
2021-08-16T11:46:18.694-0700 WARN application/managed_mode.go:304 failed to ack update open C:\Program Files\Elastic\Agent\data.update-marker: The system cannot find the file specified.
2021-08-16T11:46:19.069-0700 WARN [tls] tlscommon/tls_config.go:98 SSL/TLS verifications disabled.
2021-08-16T11:46:19.330-0700 ERROR fleet/fleet_gateway.go:205 Could not communicate with fleet-server Checking API will retry, error: status code: 400, fleet-server returned an error: BadRequest
//
This happens when the elastic fleet agent hits 600+ MB of memory usage. Restarting the service and you are good to go and the error will not show again until the fleet agent is 600+ MB usage.
After the fleet agent restarts in Kibana GUI:
circuit_breaking_exception: [circuit_breaking_exception] Reason: [in_flight_requests] Data too large, data for [<http_request>] would be [8875977956/8.2gb], which is larger than the limit of [8589934592/8gb]
This has also causing another issue I posted about. Excessive network sessions. Was finally able to track that down to the same issue above. When the fleet agent stall's out you will have clients or the fleet agent starting to sending thousands of open network sessions. Average per agents is 17,000 which turns out to be 1GB traffic per agent. I mean it's fun seeing if your network is build to handle massive session limit's but not good for production as you can saturate uplinks.