Logs not being harvested after agent restart

Hi,

We are using Elastic Agent to collect logs using the Custom Logs integration. This is on Elastic Cloud.

The agent servers are turned off nightly. When the servers come up and the elastic-agent service is started, Filebeat is not harvesting the logs, and therefore no logs are in Kibana.

The issue occurs intermittently but probably 75% of the time.

I can usually recreate the issue by stopping and restarting the elastic-agent service.

Filebeat will begin to harvest the logs again if I push a policy change from the Fleet server (even if the change does nothing)

The server agent status in Fleet shows as Healthy in all cases.

 elastic-agent version
Binary: 7.16.2 (build: 3c518f4d17a15dc85bdd68a5a03d5af51d9edd8e at 2021-12-19 00:17:05 +0000 UTC)
Daemon: 7.16.2 (build: 3c518f4d17a15dc85bdd68a5a03d5af51d9edd elastic-agent status
Status: HEALTHY
Message: (no message)
Applications:
  * filebeat             (HEALTHY)
                         Running
  * metricbeat           (HEALTHY)
                         Running
  * filebeat_monitoring  (HEALTHY)
                         Running
8e at 2021-12-19 00:17:05 +0000 UTC)

I've reviewed the logs in Fleet and I've not been able to see an obvious cause.

Any help or suggestions would be greatly appreciated.

As you are on 7.16.2, you could run the elastic-agent diagnostics collect command. It will collect all the logs. Especially in the filebeat logs, I would expect to see some entries on why Filebeat is stuck and stopped sending.

Thanks @ruflin

I reviewed the logs and spotted the following 2 errors in the elastic-agent-json.log so will start looking into these.

{"log.level":"error","@timestamp":"2022-01-18T15:56:12.322Z","log.origin":{"file.name":"application/managed_mode.go","file.line":249},"message":"could not recover state, error acknowledge 1 actions '[action_id: policy:ee11dee0-751e-11ec-afe0-ff07061c2024:15:1, type: POLICY_CHANGE]' for elastic-agent '791f2e33-95f5-48e3-ab7e-a794b236b1e0' failed: status code: 0, fleet-server returned an error: , message: Unknown resource., skipping...","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2022-01-18T15:56:13.221Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":205},"message":"Could not communicate with fleet-server Checking API will retry, error: status code: 0, fleet-server returned an error: , message: Unknown resource.","ecs.version":"1.6.0"}

Thanks,
Jon

Where is fleet-server running? In Elastic Cloud? How many Elastic Agents do you have?

It would be good to have a look at the logs of fleet-server, something with the communication to fleet-server seems to be flaky. There was a bug we had in 7.15 that looked very similar to this but as you are on 7.16.2 this should not happen.

Hi Ruflin,

This is on Elastic Cloud. We have 13 agents - all showing as healthy.
For info we are now on 7.16.3.

I've noticed for agents without the issue I see the following in the logs, but these lines aren't there when the issue occurs:

[elastic_agent.filebeat][info] Applying settings for filebeat.inputs
[elastic_agent.filebeat][info] Configured paths: [/var/log/nginx/*.access.log]

i had a look at the fleet logs but nothing obvious (to me) stood out.

Many thanks,

Jon

Are the machines / Elastic Agents that get stuck in any way different? Or is it random to which ones it applies?

Could you send me a PM with the cluster id soI could take a look at the logs? My suspicion is that it is not related to fleet-server but still worth checking.

Hello Ruflin,

It seems to be random. The servers are all of a similar type (VMs, running Centos7).
There are 2 agent policies in use and servers used by both are affected. I tried really simplifying the policies but it had no affect.

I've sent you a PM.

Thanks
Jon

Thanks for the cluster id. I had a quick look at the fleet-server logs and couldn't find anything obvious in there either.

I assume the log entries you have above are from the filebeat logs. If this shows up in one diagnostics file but not the other, I wonder if there is a diff in the policy. Part of the diagnostics file is also the currently running policy. Could you compare these of 2 running Elastic Agents, one with the problem and the other one without.

The part I don't get is why this would happen after a restart :thinking:

Hi ruflin,

I couldn't see any differences between the policies with a working server and one that is broken.
I have it raised as a support ticket and it is being looked at so hopefully they can get to the bottom of it. I noticed today there is a new release (v 1.0.0) of the Custom Logs integration which perhaps has a fix.

thanks for your advice
Jon

Thanks for raising an internal ticket. We try to get to the bottom of this. I would assume the custom log integration should not make a difference in this but I might be wrong here.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.