Filebeat stop to send logs to Elastic after cluster_block_exception occurs

I'm containerizing a monolith application and in each current VM there is a Logstash process which harvests the logs that are being rotating by the application itself, through the python's RotatingFileHandler.

In the current stack, Logstash send the logs to an Elastic index and sometimes happened that the cluster runed out of disk space, but since Logstash retries indefinitely, once we made room for new data, no logs went lost.

Now, we would like to try out Filebeat instead of Logstash and so we decided to try this approach:

  • a persistent external filesystem
  • a container instance for our application, which mounts the filesystem to write the rotated log, with a uuid as name (something like 202011_uuid.log.*) so the scaling will cause no conflicts
  • a single container, for the whole cluster, with Filebeat, which mounts the filesystem, and harvest the logs and send them directly to Elastic.

Now this approach works well until Elastic runs out of space and raises an cluster_block_exception. In this scenario, with the default config, Filebeat should retry indefinitely (https://www.elastic.co/guide/en/beats/filebeat/7.10/elasticsearch-output.html#_max_retries) but actually does only 3 attempts, in around 2 minutes, and then drops the event completely.

Doing some research, I stumbled upon some users that setted the max_retries to -1, but since I cannot be sure to rapidly solve the disk space in production, and don't want Filebeat flood my index, I ended up configuring Filebeat in this way:

output.elasticsearch:
  backoff.init: 20s
  backoff.max: 72h
  max_retries: -1

Now, with the configuration above, seems like Filebeat ignores the retries and does a single attemp.

What I'm doing wrong? Is it a flaw in my stack or some configuration issue?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.