Filebeat write errors

In terms of filebeat monitoring:
What is the difference between libbeat.output VS libbeat.pipeline?
Is a non-zero libbeat.output.failed a cause for concern? Are we losing data?

We are using Filebeat 7.4.1 in production shipping metrics as per configuration:

      filebeat.autodiscover:
          providers:
            - type: kubernetes
              ...
               
        processors:
          - add_cloud_metadata: ~
          - add_docker_metadata: ~

        .....
       
        monitoring:
          enabled: true

We have observed very detailed metrics around filebeat output such as this generated event (see below).

"beats_stats": {
      "metrics": {
        "registrar": {
          "writes": {
            "success": 1159964,
            "total": 1159964,
            "fail": 0
          },
          "states": {
            "cleanup": 175,
            "current": 92,
            "update": 13382777
          }
        },
        "filebeat": {
          "events": {
            "added": 13382781,
            "done": 13382777,
            "active": 4
          },
          "harvester": {
            "closed": 1282,
            "running": 7,
            "open_files": 7,
            "skipped": 0,
            "started": 1289
          },
          "input": {
            "netflow": {
              "packets": {
                "received": 0,
                "dropped": 0
              },
              "flows": 0
            },
            "log": {
              "files": {
                "renamed": 0,
                "truncated": 0
              }
            }
          }
        },
        "libbeat": {
          "pipeline": {
            "clients": 37,
            "events": {
              "retry": 79623,
              "active": 4,
              "total": 13382781,
              "filtered": 2863,
              "published": 13379918,
              "failed": 0,
              "dropped": 0
            },
            "queue": {
              "acked": 13379914
            }
          },
          "config": {
            "reloads": 0,
            "module": {
              "running": 0,
              "starts": 0,
              "stops": 0
            }
          },
          "output": {
            "events": {
              "active": 21,
              "toomany": 33147,
              "batches": 1159990,
              "total": 13413082,
              "acked": 13379914,
              "failed": 33147,
              "dropped": 0,
              "duplicates": 0
            },
            "write": {
              "bytes": 39927190886,
              "errors": 0
            },
            "read": {
              "errors": 0,
              "bytes": 3115564222
            },
            "type": "elasticsearch"
          }
        },
        "system": {
          "load": {
            "1": 0.1,
            "5": 0.11,
            "15": 0.16,
            "norm": {
              "1": 0.0063,
              "5": 0.0069,
              "15": 0.01
            }
          },
          "cpu": {
            "cores": 16
          }
        },
        "beat": {
          "info": {
            "uptime": {
              "ms": 2406820129
            },
            "ephemeral_id": "dbc0e27c-9743-4b4b-9b3c-6a38478e198a"
          },
          "memstats": {
            "gc_next": 50187360,
            "rss": 170217472,
            "memory_total": 840912263792,
            "memory_alloc": 25784432
          },
          "cpu": {
            "user": {
              "time": {
                "ms": 10803749
              },
              "ticks": 10803740
            },
            "system": {
              "time": {
                "ms": 9171426
              },
              "ticks": 9171420
            },
            "total": {
              "time": {
                "ms": 19975175
              },
              "value": 19975160,
              "ticks": 19975160
            }
          },
          "runtime": {
            "goroutines": 299
          },
          "handles": {
            "open": 21,
            "limit": {
              "soft": 1048576,
              "hard": 1048576
            }
          }
        }
      }

Are you using Stack Monitoring?

The entries "toomany": 33147 match with "failed": 33147.

In the context of Elasticsearch output, toomany is the number of requests which received a 429 TOO MANY REQUESTS from Elasticsearch (aka "Indexing Bulk rejections").

It means Elasticsearch is getting overwhelmed by the requests and tells Beats to slow down.

I would suggest to:

  • Verify you're using the Filebeat index templates which come with Filebeat
  • Verify the cluster is correctly sized compared to the amount of data being sent
    • The cluster must not perform JVM GC too frequently. To do so, you need to keep the number of shards/indces low.
  • Filebeat can be tweaked to send "bigger" document bulks (defaults to 125 documents per bulk). It is usually more efficient to send bigger bulks.

You can check the rejections on the cluster with: GET /_cat/thread_pool/write?v.

hi Luca,
Thanks for the response.
We are definitely getting bulk write rejections, and as we've increased the volume of ingestion on our cluster.
We are using an appropriate filebeat template, with an Elastic pipeline to parse/process incoming data. We have increased our write bulk max size.

I need to quantify is how much data we are loosing. I see that retry is also non-zero (and large), which isn't a healthy sign either.
But I'm still wondering how much data we're losing and how we can remedy (in short and long term).
Short term filebeat tuning: increase num of retries, increase flush timeout. Spooling?

Longer term: Scale the cluster data nodes and shard count.
Introduce Logstash as an an intermediary buffer.

Any short term suggestions to stem the bleeding?

-marcus

Hello again,
I've seen no response to my question on the definitions of dropped vs failed output metrics. Could someone clarify that?

We see this surge of failed/dropped/retry counts typically when new filebeat hosts join, or when we purposely clear the filebeat registry (in testing environment). This causes a temporary surge of traffic to the Elastic cluster.

I was heartened to see that in the filebeat docs it states:
"
Filebeat ignores the max_retries setting and retries indefinitely.
"
So does this imply that we logs will eventually be published?

Thanks again!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.