So after looking back at a few machines that did drop offline again today which did come online after several hours. Several agents are from the last host that went down for updates. But not all as I have 4 nodes they can connect to. Only some of the agents have just stopped. The only thing I can see is the 2020-09 updates where just applied to all of the machines offline but others have 2020-09 and they work...
I started looking at the last logs as they started dropping offline in kibana and then stopped sending logs altogether today. Some as of a few minutes ago. This was not expected...
This is default config out of the box no changes to system yet. Each of the offline nodes had high memory usage.
Metricbeat bundled with Endpoint. After the Elastic-Agent and Elastic-Endpoint is stopped it still runs along with Filebeat. I have disabled the standalone Metricbeat on the endpoints in question for testing just to see prior.
One of the last logs in the the ingest manager:
"malware": {
"concerned_actions": [
"agent_connectivity",
"load_config",
"workflow",
"download_global_artifacts",
"download_user_artifacts",
"configure_malware",
"read_malware_config",
"load_malware_model",
"read_kernel_config",
"configure_kernel",
"detect_process_events",
"detect_file_write_events",
"connect_kernel",
"detect_file_open_events",
"detect_sync_image_load_events"
],
"status": "failure"
},
"streaming": {
"concerned_actions": [
"agent_connectivity",
"load_config",
"read_elasticsearch_config",
"configure_elasticsearch_connection",
"workflow"
],
"status": "success"
}
}
},
"status": "failure"
From the windows application log:
Faulting application name: elastic-endpoint.exe, version: 7.9.0.0, time stamp: 0x5f32bdd7
Faulting module name: elastic-endpoint.exe, version: 7.9.0.0, time stamp: 0x5f32bdd7
Exception code: 0xc0000005
From the local machine endpoint-xxx.log from the same machine the above logs and snip is from:
{"@timestamp":"2020-09-10T23:22:32.95811900Z","agent":{"id":"removed","type":"endpoint"},"ecs":{"version":"1.5.0"},"log":{"level":"info","origin":{"file":{"line":1392,"name":"HttpLib.cpp"}}},"message":"HttpLib.cpp:1392 Establishing GET connection to [https://node3:9200/_cluster/health]","process":{"pid":5496,"thread":{"id":2140}}}
{"@timestamp":"2020-09-10T23:22:32.95811900Z","agent":{"id":"71cfd898-0cf9-47c5-a97d-bb8f3f3b1f9a","type":"endpoint"},"ecs":{"version":"1.5.0"},"log":{"level":"notice","origin":{"file":{"line":65,"name":"BulkQueueConsumer.cpp"}}},"message":"BulkQueueConsumer.cpp:65 Elasticsearch connection is down","process":{"pid":5496,"thread":{"id":2140}}}
If you want more detailed logs just tell me what you need. I will PM them underacted as I have a decent test bed to pull from. If you want a alpha genie pig for a beta I'll do that as well.