Elastic Agent defunct on fleet server and clients

Hello there,

I have installed elastic-agent on a ubuntu-server box and when I list which process is running I get this:

image

And then I realized that I got the same on the fleet server:

This doesn´t look normal.

Elastic version on both servers:

Binary: 8.2.1 (build: 40ea6cb697bcb76375527092a19d7413bfa00f3f at 2022-05-19 19:19:07 +0000 UTC)
Daemon: 8.2.1 (build: 40ea6cb697bcb76375527092a19d7413bfa00f3f at 2022-05-19 19:19:07 +0000 UTC)

Thanks for the attention.

Was this directly after an install? The installation command starts another instance of elastic-agent that will run and gather data.

Yes!!! Actually I´ve done another install in a new box and the results was the same.

Just checked again:

Fleet server:

Client server:

It's expected behaviour due to the install process. If you take a look in Kibana you should see all your agents there.

Indeed. The problem is that there´s no logs coming through.

I´m integrating misp using the elastic-agent but no luck...

You are able to see the agents, but no logs?
Is there a proxy between the agents and Elasticsearch?

Nope. No proxy. Same network

This what I can see:

And this is the log page from the agent selected:

Can you provide a diagnostics bundle from an effected agent? via elastic-agent diagnostics collect?

Of course. The only problem I can see is that I cannot upload zip files, only images.

@francescouk please don't post pictures of text, logs or code. They are difficult to read, impossible to search and replicate (if it's code), and some people may not be even able to see them :slight_smile:

You will need to store the diagnostics elsewhere and link to them here.

That´s perfect. Ill do right now

1 Like

Hi there,

Follow the shared link with elastic-agent diagnostics.

Thanks

Those are agent logs. Do you have them enabled in agent configuration even? Also you can click on dataset and see logs from filebeat etc, to see if it's collecting data or if there is any error. To see misp logs itself go to security - events section or to Discover.

Hi there,

No logs at all. Only from windows servers. I´ve done a tcpdump to see if there´s comunication between the linux server to fleet server and I can see that exists.

But if I go directly to the dataset from this server(linux) it´s shows no log.

I found the filebeat dataset:

08:47:03.588
elastic_agent.filebeat
[elastic_agent.filebeat][info] Non-zero metrics in the last 30s
08:47:33.588
elastic_agent.filebeat
[elastic_agent.filebeat][info] Non-zero metrics in the last 30s
08:47:50.159
elastic_agent.filebeat
[elastic_agent.filebeat][info] File is inactive. Closing because close_inactive of 5m0s reached.
08:48:03.588
elastic_agent.filebeat
[elastic_agent.filebeat][info] Non-zero metrics in the last 30s
08:48:33.588
elastic_agent.filebeat
[elastic_agent.filebeat][info] Non-zero metrics in the last 30s
08:48:53.651
elastic_agent.filebeat
[elastic_agent.filebeat][error] request failed
08:49:03.588
elastic_agent.filebeat
[elastic_agent.filebeat][info] Non-zero metrics in the last 30s
08:49:33.588
elastic_agent.filebeat
[elastic_agent.filebeat][info] Non-zero metrics in the last 30s
08:50:03.588
elastic_agent.filebeat
[elastic_agent.filebeat][info] Non-zero metrics in the last 30s

Also from metricbeat:

08:51:34.112
elastic_agent.metricbeat
[elastic_agent.metricbeat][info] Non-zero metrics in the last 30s
08:52:04.111
elastic_agent.metricbeat
[elastic_agent.metricbeat][info] Non-zero metrics in the last 30s
08:52:34.111
elastic_agent.metricbeat
[elastic_agent.metricbeat][info] Non-zero metrics in the last 30s
08:53:04.111
elastic_agent.metricbeat
[elastic_agent.metricbeat][info] Non-zero metrics in the last 30s
08:53:34.111
elastic_agent.metricbeat
[elastic_agent.metricbeat][info] Non-zero metrics in the last 30s
08:54:04.112
elastic_agent.metricbeat
[elastic_agent.metricbeat][info] Non-zero metrics in the last 30s
08:54:04.112
elastic_agent.metricbeat
[elastic_agent.metricbeat][info] Non-zero metrics in the last 30s

That´s all

Very strange,

The logs in the diagnostics you provide show that the agent is unable to check in with fleet, it's either timing out or returning 400 responses:

{"log.level":"error","@timestamp":"2022-05-26T15:38:22.673Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":206},"message":"Could not communicate with fleet-server Checking API will retry, error: status code: 400, fleet-server returned an error: BadRequest","ecs.version":"1.6.0"}
...
{"log.level":"error","@timestamp":"2022-05-27T00:28:55.986Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":206},"message":"Could not communicate with fleet-server Checking API will retry, error: fail to checkin to fleet-server: Post \"https://XX.XX.XX.XX:8220/api/fleet/agents/0abf252d-1e5e-4312-a24c-ee9fdacdedd8/checkin?\": EOF","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2022-05-27T00:31:09.504Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":206},"message":"Could not communicate with fleet-server Checking API will retry, error: fail to checkin to fleet-server: Post \"https://XX.XX.XX.XX:8220/api/fleet/agents/0abf252d-1e5e-4312-a24c-ee9fdacdedd8/checkin?\": dial tcp XX.XX.XX.XX:8220: connect: no route to host","ecs.version":"1.6.0"}

I'm not sure what's causing this, it would be part of the fleet-server logs (if you want to check).

Your filebeat logs are also indicating timeout issues:

{"log.level":"error","@timestamp":"2022-05-30T09:48:51.525-0300","log.logger":"input.httpjson-cursor.retryablehttp","log.origin":{"file.name":"go-retryablehttp@v0.6.6/client.go","file.line":553},"message":"request failed","service.name":"filebeat","id":"httpjson-ti_misp.threat-020bdbd9-5a36-43b3-95c6-85b5b7d3392f","input_source":"https://XX.XX.XX.XX/events/restSearch","input_url":"https://XX.XX.XX.XX/events/restSearch","error":{"message":"Post \"https://XX.XX.XX.XX/events/restSearch\": net/http: request canceled (Client.Timeout exceeded while awaiting headers)"},"method":"POST","url":"https://XX.XX.XX.XX/events/restSearch","ecs.version":"1.6.0"}

You've also mentioned that you have deployed to Windows machines, however your config has a unix path for the CA cert; we don't support mixing unix and windows filepaths in the config, you should inline the CA instead.

No, what I said was that was receiving logs from the windows server boxes but not from linux machines, so there´s no mixing about the config.

About the fleet server, what exactly I need to check? Which logs?

So far, the logs I can see on the fleet server which dosent look normal is:

{"log.level":"error","@timestamp":"2022-05-27T09:52:14.579-0300","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":206},"message":"Could not communicate with fleet-server Checking API will retry, error: status code: 400, fleet-server returned an error: BadRequest","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2022-05-27T12:40:51.739-0300","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":206},"message":"Could not communicate with fleet-server Checking API will retry, error: status code: 400, fleet-server returned an error: BadRequest","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2022-05-28T18:10:37.712-0300","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":206},"message":"Could not communicate with fleet-server Checking API will retry, error: status code: 400, fleet-server returned an error: BadRequest","ecs.version":"1.6.0"}

And this one:

{"log.level":"error","@timestamp":"2022-05-24T23:24:19.440-0300","log.origin":{"file.name":"process/app.go","file.line":290},"message":"failed to stop fleet-server: os: process already finished","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2022-05-26T20:53:26.475-0300","log.origin":{"file.name":"log/reporter.go","file.line":36},"message":"2022-05-26T20:53:26-03:00 - message: Application: filebeat--8.2.1--36643631373035623733363936343635[cc0e4a76-f1fb-4b7c-8b6e-5ac8857355c6]: State changed to FAILED: failed to stop after 30s: application stopping timed out - type: 'ERROR' - sub_type: 'FAILED'","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2022-05-26T20:53:26.475-0300","log.origin":{"file.name":"process/app.go","file.line":158},"message":"failed to stop after 30s: application stopping timed out","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2022-05-26T20:53:26.475-0300","log.origin":{"file.name":"process/app.go","file.line":290},"message":"failed to stop fleet-server: os: process already finished","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2022-05-26T21:29:28.652-0300","log.origin":{"file.name":"process/app.go","file.line":158},"message":"failed to stop after 30s: application stopping timed out","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2022-05-26T21:29:28.652-0300","log.origin":{"file.name":"process/app.go","file.line":290},"message":"failed to stop fleet-server: os: process already finished","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2022-05-26T21:29:28.652-0300","log.origin":{"file.name":"log/reporter.go","file.line":36},"message":"2022-05-26T21:29:28-03:00 - message: Application: fleet-server--8.2.1[cc0e4a76-f1fb-4b7c-8b6e-5ac8857355c6]: State changed to FAILED: failed to stop after 30s: application stopping timed out - type: 'ERROR' - sub_type: 'FAILED'","ecs.version":"1.6.0"}

I have installed a new linux ubuntu server with misp and still the same error. My question is, does it really work this misp integration with Elasticsearch???

As soon as I connected to this box, the first message that appears:

 => There are 2 zombie processes.

And then check the status:

administrator@misp:~$ ps axo stat,ppid,pid,comm | grep -w defunct
Zs   12479 12511 elastic-agent <defunct>
Zs   12479 12702 elastic-agent <defunct>

I really dont know if this is normal behaviour....