Possible race condition on metricbeat startup (rename meta.json.new fails)

Hey,

I use Metricbeat (version 8.13.2) as sidecar in an AWS ECS Service. Every now and then (approx. 1 out of 100 task startups, but not in regular intervals) metricbeat fails on startup with the below error message. As it works most of the time I suspect it to be some kind of race condition but there is no additional log output or an easy way to trigger this in order to really debug this issue.

{
  "log.level":"error",
  "@timestamp":"2024-04-22T13:43:24.732Z",
  "log.origin":{
    "function":"github.com/elastic/beats/v7/libbeat/cmd/instance.handleError",
    "file.name":"instance/beat.go",
    "file.line":1340
  },
  "message":"Exiting: rename /usr/share/metricbeat/data/meta.json.new /usr/share/metricbeat/data/meta.json: no such file or directory",
  "service.name":"metricbeat",
  "ecs.version":"1.6.0"
}

Could anyone provide insights or suggestions on what might be causing this issue and how to resolve it? Any help is greatly appreciated!

Looking at the code, at startup Metricbeat tries to read /usr/share/metricbeat/data/meta.json, if the file does not exist it creates a new one /usr/share/metricbeat/data/meta.json.new, writes data to it, closes the file and then moves it to /usr/share/metricbeat/data/meta.json. The error is coming from this very last operation: moving the file.

What is odd is that the file was created and written successfully, only the move operation is failing.

I managed to reproduce it by stopping Metricbeat's execution right before it moves the file, removing the file and letting Metricbeat continue.

It really looks like something is happening in the filesystem after /usr/share/metricbeat/data/meta.json.new is closed.
Is any of the folders /usr/share/metricbeat/data/ a volume/mount?

How are you running Metricbeat? /usr/share/metricbeat/data is the default path.home when running Metricbeat in Docker, are you using a Docker container?

Hey,

thanks for having a look into the issue! I also found out a few more things and can provide you with more context.

The mentioned Metricbeat is running as ECS-Service on AWS. There is no volume mounted. Here is the containerfile for more context:

ARG ELK_VERSION  # Currently at ELK_VERSION=8.13.2
FROM elastic/metricbeat:${ELK_VERSION}

# Remove all default configs
RUN mv /usr/share/metricbeat/modules.d/system.yml /usr/share/metricbeat/modules.d/system.yml.disabled

# Copy the configuration file into the container
COPY metricbeat.yml /usr/share/metricbeat/metricbeat.yml
USER root
RUN chown root:metricbeat /usr/share/metricbeat/metricbeat.yml && \
    chmod go-w /usr/share/metricbeat/metricbeat.yml

# Switch back to the metricbeat user
USER metricbeat

CMD ["metricbeat", "-c" , "/usr/share/metricbeat/metricbeat.yml"]

And also some more logs when the crash happened

[WARN  tini (7)] Tini is not running as PID 1 and isn't registered as a child subreaper.
Zombie processes will not be re-parented to Tini, so zombie reaping won't work.
To fix the problem, use the -s option or set the environment variable TINI_SUBREAPER to register Tini as a child subreaper, or run Tini as PID 1.
{"log.level":"info","@timestamp":"2024-04-30T13:37:34.672Z","log.origin":{"function":"github.com/elastic/beats/v7/libbeat/cmd/instance.(*Beat).configure","file.name":"instance/beat.go","file.line":811},"message":"Home path: [/usr/share/metricbeat] Config path: [/usr/share/metricbeat] Data path: [/usr/share/metricbeat/data] Logs path: [/usr/share/metricbeat/logs]","service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-04-30T13:37:34.698Z","log.origin":{"function":"github.com/elastic/beats/v7/libbeat/cmd/instance.handleError","file.name":"instance/beat.go","file.line":1340},"message":"Exiting: rename /usr/share/metricbeat/data/meta.json.new /usr/share/metricbeat/data/meta.json: no such file or directory","service.name":"metricbeat","ecs.version":"1.6.0"}
Exiting: rename /usr/share/metricbeat/data/meta.json.new /usr/share/metricbeat/data/meta.json: no such file or directory

Maybe the failed renaming is just a side effect of the container already dying. What I still don't understand is why Tini has issues and why this occurs only sporadically

From the top of my head I can't see why this happens sporadically either, however the first thing to try is to keep the ENTRYPOINT and CMD behaving as in the original Docker image: beats/dev-tools/packaging/templates/docker/Dockerfile.tmpl at c1748f7965c9bfd4d488403bb2a734f6f5627219 · elastic/beats · GitHub

For that I suggest you edit your CMD line to:

CMD ["-environment", "container", "-c" , "/usr/share/metricbeat/metricbeat.yml"]