Elastic Endpoint service failed (missed 3 check-ins)

Hello

I have been having issues with Elastic Agents on my Proxmox hosts (Debian12) and in particular the Endpoint integration that fails to become Healthy. All other integrations, like packet capture and system logs are OK

After few seconds after restarting the elastic agent the endpoint is unable to register with the agent

# elastic-agent status
┌─ fleet
│  └─ status: (HEALTHY) Connected
└─ elastic-agent
   ├─ status: (DEGRADED) 1 or more components/units in a failed state
   └─ endpoint-default
      ├─ status: (FAILED) Failed: endpoint service missed 3 check-ins
      ├─ endpoint-default
      │  └─ status: (FAILED) Failed: endpoint service missed 3 check-ins
      └─ endpoint-default-92fa049c-8082-482f-9328-aa425f583e8e
         └─ status: (FAILED) Failed:

Here is full set of logs from elastic

|Mar 11, 2025 @ 18:17:12.251|Component state changed endpoint-default (DEGRADED->FAILED): Failed: endpoint service missed 3 check-ins|
|---|---|
|Mar 11, 2025 @ 18:17:12.251|Unit state changed endpoint-default-92fa049c-8082-482f-9328-aa425f583e8e (STARTING->FAILED): Failed: endpoint service missed 3 check-ins|
|Mar 11, 2025 @ 18:17:12.251|Unit state changed endpoint-default (STARTING->FAILED): Failed: endpoint service missed 3 check-ins|
|Mar 11, 2025 @ 18:16:12.250|Component state changed endpoint-default (STARTING->DEGRADED): Degraded: endpoint service missed 1 check-in|
|Mar 11, 2025 @ 18:15:43.146|Unit state changed log-default-logfile-system-e7140f01-940d-4ba2-9bcc-e9f20352aaf1 (STARTING->HEALTHY): Healthy|
|Mar 11, 2025 @ 18:15:43.145|Unit state changed log-default (STARTING->HEALTHY): Healthy|
|Mar 11, 2025 @ 18:15:42.984|Unit state changed osquery-default (STARTING->HEALTHY): Healthy|
|Mar 11, 2025 @ 18:15:42.984|Unit state changed osquery-default-51e20a89-436f-4787-93b9-cdd91a4699b1 (STARTING->HEALTHY): Healthy|
|Mar 11, 2025 @ 18:15:42.749|Unit state changed packet-default-packet-network-c10e94db-070f-41ca-8834-b1531c55ec52 (STARTING->HEALTHY): Healthy|
|Mar 11, 2025 @ 18:15:42.748|Unit state changed packet-default (STARTING->HEALTHY): Healthy|
|Mar 11, 2025 @ 18:15:42.564|control checkin v2 protocol has chunking enabled|
|Mar 11, 2025 @ 18:15:42.449|control checkin v2 protocol has chunking enabled|
|Mar 11, 2025 @ 18:15:42.326|control checkin v2 protocol has chunking enabled|
|Mar 11, 2025 @ 18:15:42.249|2025-03-11 16:15:42: info: InstallLib.cpp:650 Installed endpoint is expected version (version: 8.17.3, compiled: Wed Feb 26 21:00:00 2025, branch: HEAD, commit: e54b5de09796d1b3601f7d5472359c11fafafc67)|
|Mar 11, 2025 @ 18:15:42.249|after check if endpoint service is installed, err: <nil>|
|Mar 11, 2025 @ 18:15:42.148|2025-03-11 16:15:42: debug: ProcFile.cpp:855 Found 1 cgroups for pid(3355999)|
|Mar 11, 2025 @ 18:15:42.148|2025-03-11 16:15:42: debug: ProcFile.cpp:861 cgroup: id=0 type= path=/system.slice/elastic-agent.service|
|Mar 11, 2025 @ 18:15:42.148|2025-03-11 16:15:42: info: MainPosix.cpp:389 Verifying existing installation|
|Mar 11, 2025 @ 18:15:42.148|2025-03-11 16:15:42: info: InstallLib.cpp:610 Running [/opt/Elastic/Endpoint/elastic-endpoint] [version --log stdout]|
|Mar 11, 2025 @ 18:15:42.148|2025-03-11 16:15:42: debug: Exec.cpp:189 ChildMonitor is pid 3356002 and monitoring pids 3355999 and 3356000|
|Mar 11, 2025 @ 18:15:42.144|Creating connection info server for endpoint service, address: unix:///var/lib/elastic-agent/.eaci.sock|
|Mar 11, 2025 @ 18:15:42.144|check if endpoint service is installed|
|Mar 11, 2025 @ 18:15:42.144|Spawned new component endpoint-default: Starting: endpoint service runtime|
|Mar 11, 2025 @ 18:15:42.144|Spawned new unit endpoint-default-92fa049c-8082-482f-9328-aa425f583e8e: Starting: endpoint service runtime|
|Mar 11, 2025 @ 18:15:42.144|Spawned new unit endpoint-default: Starting: endpoint service runtime|
|Mar 11, 2025 @ 18:15:42.143|control checkin v2 protocol has chunking enabled|
|Mar 11, 2025 @ 18:15:42.143|Component state changed log-default (STARTING->HEALTHY): Healthy: communicating with pid '3355981'|
|Mar 11, 2025 @ 18:15:42.042|Spawned new component log-default: Starting: spawned pid '3355981'|
|Mar 11, 2025 @ 18:15:42.042|Spawned new unit log-default-logfile-system-e7140f01-940d-4ba2-9bcc-e9f20352aaf1: Starting: spawned pid '3355981'|
|Mar 11, 2025 @ 18:15:42.042|Spawned new unit log-default: Starting: spawned pid '3355981'|
|Mar 11, 2025 @ 18:15:41.982|control checkin v2 protocol has chunking enabled|
|Mar 11, 2025 @ 18:15:41.982|Component state changed osquery-default (STARTING->HEALTHY): Healthy: communicating with pid '3355964'|
|Mar 11, 2025 @ 18:15:41.913|Spawned new component osquery-default: Starting: spawned pid '3355964'|
|Mar 11, 2025 @ 18:15:41.913|Spawned new unit osquery-default-51e20a89-436f-4787-93b9-cdd91a4699b1: Starting: spawned pid '3355964'|
|Mar 11, 2025 @ 18:15:41.913|Spawned new unit osquery-default: Starting: spawned pid '3355964'|
|Mar 11, 2025 @ 18:15:41.796|Component state changed packet-default (STARTING->HEALTHY): Healthy: communicating with pid '3355947'|
|Mar 11, 2025 @ 18:15:41.746|control checkin v2 protocol has chunking enabled|
|Mar 11, 2025 @ 18:15:41.681|Spawned new component packet-default: Starting: spawned pid '3355947'|
|Mar 11, 2025 @ 18:15:41.681|Spawned new unit packet-default-packet-network-c10e94db-070f-41ca-8834-b1531c55ec52: Starting: spawned pid '3355947'|
|Mar 11, 2025 @ 18:15:41.681|Spawned new unit packet-default: Starting: spawned pid '3355947'|
|Mar 11, 2025 @ 18:15:41.503|Updating running component model|
|Mar 11, 2025 @ 18:15:41.503|SSL/TLS verifications disabled.|
|Mar 11, 2025 @ 18:15:41.493|Starting stats endpoint|
|Mar 11, 2025 @ 18:15:41.493|Metrics endpoint listening on: 127.0.0.1:6791 (configured: http://localhost:6791)|
|Mar 11, 2025 @ 18:15:41.492|Starting monitoring server with cfg &config.MonitoringConfig{Enabled:true, MonitorLogs:true, MonitorMetrics:true, MetricsPeriod:, FailureThreshold:(*uint)(nil), LogMetrics:true, HTTP:(*config.MonitoringHTTPConfig)(0xc001b828a0), Namespace:default, Pprof:(*config.PprofConfig)(nil), MonitorTraces:false, APM:config.APMConfig{Environment:, APIKey:, SecretToken:, Hosts:[]string(nil), GlobalLabels:map[string]string(nil), TLS:config.APMTLS{SkipVerify:false, ServerCertificate:, ServerCA:}, SamplingRate:(*float32)(nil)}, Diagnostics:config.Diagnostics{Uploader:config.Uploader{MaxRetries:10, InitDur:1000000000, MaxDur:600000000000}, Limit:config.Limit{Interval:60000000000, Burst:1}}}|
|Mar 11, 2025 @ 18:15:41.492|creating monitoring API with cfg api.Config{Enabled:true, Host:http://localhost:6791, Port:6791, User:, SecurityDescriptor:, Timeout:5000000000}|
|Mar 11, 2025 @ 18:15:41.491|Source URI changed from https://artifacts.elastic.co/downloads/ to https://artifacts.elastic.co/downloads/|
|Mar 11, 2025 @ 18:15:41.489|Fleet gateway started|
|Mar 11, 2025 @ 18:15:41.484|Setting fallback log level <nil> from policy|
|Mar 11, 2025 @ 18:15:41.477|restoring current policy from disk|
|Mar 11, 2025 @ 18:15:40.867|Docker provider skipped, unable to connect: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?|
|Mar 11, 2025 @ 18:15:40.866|GRPC control socket listening at unix:///var/lib/elastic-agent/elastic-agent.sock|
|Mar 11, 2025 @ 18:15:40.866|Starting grpc control protocol listener on port 6789 with max_message_size 104857600|
|Mar 11, 2025 @ 18:15:40.851|Parsed configuration and determined agent is managed by Fleet|
|Mar 11, 2025 @ 18:15:40.851|SSL/TLS verifications disabled.|
|Mar 11, 2025 @ 18:15:40.842|GRPC comms socket listening at localhost:6789|
|Mar 11, 2025 @ 18:15:40.607|Capabilities file not found in /etc/elastic-agent/capabilities.yml|
|Mar 11, 2025 @ 18:15:40.607|Determined allowed capabilities|
|Mar 11, 2025 @ 18:15:40.607|Loading baseline config from /etc/elastic-agent/elastic-agent.yml|
|Mar 11, 2025 @ 18:15:40.606|Detected available inputs and outputs|
|Mar 11, 2025 @ 18:15:40.595|Gathered system information|
|Mar 11, 2025 @ 18:15:40.589|APM instrumentation disabled|
|Mar 11, 2025 @ 18:15:40.587|agent is not upgradable, not starting watcher|
|Mar 11, 2025 @ 18:15:40.378|Elastic Agent started|
|Mar 11, 2025 @ 18:15:39.931|Fleet gateway stopped|
|Mar 11, 2025 @ 18:15:38.915|reexec shutdown channel triggered|
|Mar 11, 2025 @ 18:15:38.915|failed accept conn info connection: accept unix /var/lib/elastic-agent/.eaci.sock: use of closed network connection|
|Mar 11, 2025 @ 18:15:38.915|Possible transient error during checkin with fleet-server, retrying|
|Mar 11, 2025 @ 18:15:38.915|stopping endpoint service runtime|

Log output from the endpoint

cat /opt/Elastic/Endpoint/state/log/endpoint-000000.log
{"@timestamp":"2025-03-11T15:16:43.632877363Z","agent":{"id":"","type":"endpoint"},"ecs":{"version":"8.10.0"},"log":{"level":"info","origin":{"file":{"line":228,"name":"Logging.cpp"}}},"message":"Logging.cpp:228 Endpoint info: version: 8.17.0, compiled: Wed Dec 4 19:00:00 2024, branch: HEAD, commit: eea523e3a3b39f3a258c17d05b983a723bd86682","process":{"pid":3329644,"thread":{"id":3329644}}}
{"@timestamp":"2025-03-11T15:16:43.632902266Z","agent":{"id":"","type":"endpoint"},"ecs":{"version":"8.10.0"},"log":{"level":"info","origin":{"file":{"line":134,"name":"PolicyConfig.cpp"}}},"message":"PolicyConfig.cpp:134 Registered configuration callback for logging","process":{"pid":3329644,"thread":{"id":3329644}}}
{"@timestamp":"2025-03-11T15:16:43.632911736Z","agent":{"id":"","type":"endpoint"},"ecs":{"version":"8.10.0"},"log":{"level":"info","origin":{"file":{"line":176,"name":"Entry.cpp"}}},"message":"Entry.cpp:176 Loading plugin: documentLogging","process":{"pid":3329644,"thread":{"id":3329644}}}
{"@timestamp":"2025-03-11T15:52:19.154942089Z","agent":{"id":"","type":"endpoint"},"ecs":{"version":"8.10.0"},"log":{"level":"info","origin":{"file":{"line":228,"name":"Logging.cpp"}}},"message":"Logging.cpp:228 Endpoint info: version: 8.17.3, compiled: Wed Feb 26 21:00:00 2025, branch: HEAD, commit: e54b5de09796d1b3601f7d5472359c11fafafc67","process":{"pid":3345597,"thread":{"id":3345597}}}
{"@timestamp":"2025-03-11T15:52:19.154982504Z","agent":{"id":"","type":"endpoint"},"ecs":{"version":"8.10.0"},"log":{"level":"info","origin":{"file":{"line":134,"name":"PolicyConfig.cpp"}}},"message":"PolicyConfig.cpp:134 Registered configuration callback for logging","process":{"pid":3345597,"thread":{"id":3345597}}}
{"@timestamp":"2025-03-11T15:52:19.154990871Z","agent":{"id":"","type":"endpoint"},"ecs":{"version":"8.10.0"},"log":{"level":"info","origin":{"file":{"line":176,"name":"Entry.cpp"}}},"message":"Entry.cpp:176 Loading plugin: documentLogging","process":{"pid":3345597,"thread":{"id":3345597}}}
{"@timestamp":"2025-03-11T16:02:20.231786761Z","agent":{"id":"","type":"endpoint"},"ecs":{"version":"8.10.0"},"log":{"level":"info","origin":{"file":{"line":89,"name":"MainPosix.cpp"}}},"message":"MainPosix.cpp:89 Aborting due to signal","process":{"pid":3345597,"thread":{"id":3345600}}}

i checked networking as i thought that network was blocking the connection between agent and endpoint services but i could see the raw packets flowing normally

# ss -ntulp | grep elastic
tcp   LISTEN 0      4096       127.0.0.1:6789       0.0.0.0:*    users:(("elastic-agent",pid=3344921,fd=12))                                                                                                     
tcp   LISTEN 0      4096       127.0.0.1:6791       0.0.0.0:*    users:(("elastic-agent",pid=3344921,fd=13))                                                                                                                                                                                                                                                     

# tcpdump -i any -nn port 6789 or port 6791
tcpdump: data link type LINUX_SLL2
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
17:21:29.588552 lo    In  IP 127.0.0.1.51694 > 127.0.0.1.6791: Flags [.], ack 4192996267, win 512, options [nop,nop,TS val 3041978560 ecr 3041963200], length 0
17:21:29.588556 lo    In  IP 127.0.0.1.6791 > 127.0.0.1.51694: Flags [.], ack 1, win 512, options [nop,nop,TS val 3041978560 ecr 3041963200], length 0
17:21:29.588595 lo    In  IP 127.0.0.1.6791 > 127.0.0.1.51694: Flags [.], ack 1, win 512, options [nop,nop,TS val 3041978560 ecr 3041963200], length 0
17:21:29.588598 lo    In  IP 127.0.0.1.51694 > 127.0.0.1.6791: Flags [.], ack 1, win 512, options [nop,nop,TS val 3041978560 ecr 3041978560], length 0
17:21:32.749659 lo    In  IP 127.0.0.1.35932 > 127.0.0.1.6789: Flags [P.], seq 3700583884:3700584202, ack 2357837195, win 512, options [nop,nop,TS val 3041981721 ecr 3041956762], length 318
17:21:32.749664 lo    In  IP 127.0.0.1.6789 > 127.0.0.1.35932: Flags [.], ack 318, win 512, options [nop,nop,TS val 3041981721 ecr 3041981721], length 0
17:21:32.749794 lo    In  IP 127.0.0.1.6789 > 127.0.0.1.35932: Flags [P.], seq 1:53, ack 318, win 512, options [nop,nop,TS val 3041981721 ecr 3041981721], length 52
17:21:32.749879 lo    In  IP 127.0.0.1.35932 > 127.0.0.1.6789: Flags [P.], seq 318:357, ack 53, win 512, options [nop,nop,TS val 3041981721 ecr 3041981721], length 39
17:21:32.790717 lo    In  IP 127.0.0.1.6789 > 127.0.0.1.35932: Flags [.], ack 357, win 512, options [nop,nop,TS val 3041981762 ecr 3041981721], length 0
17:21:32.984679 lo    In  IP 127.0.0.1.35936 > 127.0.0.1.6789: Flags [P.], seq 138184057:138184490, ack 275999808, win 512, options [nop,nop,TS val 3041981956 ecr 3041956998], length 433

1 Like

Forgot to mention that i tried re-installing the elastic agent multiple times. Tried version 8.17.0 and 8.17.3 (currently is on this)
I also tried using the tar.gz install approach as well the deb package. Tried as well removing and re-adding just the integration to the policy. All those attempts didnt bring any benefit

I also checked this thread End point security fails for Elastic Agent with error "Missed two check-ins" - #2 by Nick_Berlin but my localhost is present in /etc/hosts

In the error.txt from the diagnostics file in the endpoint-default folder i see this error

diagnostic action timed out, deadline is 20s: context deadline exceeded