Filebeat CrashLoopBackOff

After installing filebeat from the new Helm 8.5 in our cluster k8s, 5 of the pods are stuck in CrashLoopBackOff:

{"log.level":"error","@timestamp":"2022-12-08T21:13:19.258Z","log.origin":{"file.name":"instance/beat.go","file.line":1057},"message":"Exiting: cannot obtain lockfile: connot start, data directory belongs to process with pid 10","service.name":"filebeat","ecs.version":"1.6.0"}
Exiting: cannot obtain lockfile: connot start, data directory belongs to process with pid 10

Any idea what can cause this issue?

When I ran the same helms locally with minikube everything was fine.

Hi!
same problem here as well.
We are running K8S Version 1.23
We have installed filebeat via DaemonSet as documented here: Run Filebeat on Kubernetes | Filebeat Reference [8.5] | Elastic

@Martin_Schimandl Did you had a previews filebeat instance installed on the cluster?
And are you sending it directly to elasticsearch or via logstash?

When I am using helm 7.5 to send to logstash I have no problem, it only happened if I am sending directly to ES.
Also interesting, if I uninstall filebeat (helm), wait 2 days and re-install it, it works ok.. but this is not fully tested.

I had a similar problem. Delete the file filebeat.lock

Node > /var/lib/filebeat-{..}/filebeat.lock

1 Like

We previously used fluentd and sent the data directly to elasticsearch.
We switched to filebeat in order to make it easier to send the data to logstash, like we do for our non Kubernetes infrastructure.

I am quite sure this issue is not related to the destination of the logs.
My guess is this is a race condition with the kubelet, some kernel API or something like that, since the process id in the error message is always pretty low.

My current workaround to reduce the change of this issue was:
Switch from container image docker.elastic.co/beats/filebeat:8.5.3 to docker.elastic.co/beats/filebeat:8.4.3
because it looks like this made the error appear less often

@aaszxc good idea!
Will try that when we get the crash loop again

This does not work. the file appear again after a few minutes and the pod crash again

Try to analyze resource consumption or increase resources

If you're running in a container, a solution I've found is running the following command in the container:

find /usr/share/elastic-agent/. -type f -name "*beat*lock" -exec rm {} \;

This will clean up all of the lock files related to beats running under the Elastic Agent. You will need to wait a few minutes for the agent to return to a healthy state. I've noticed that running the command the first time doesn't always work, so if you give it a few minutes and its still having issues, you can rerun the command and it should hopefully get it there eventually.

This appears to be a bug which is being worked on (Refactor beats lockfile to use timeout, retry by fearful-symmetry · Pull Request #34194 · elastic/beats · GitHub)

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.