Filebeat write errors

Marcus_Simonsen · April 28, 2020, 9:55pm

In terms of filebeat monitoring:
What is the difference between libbeat.output VS libbeat.pipeline?
Is a non-zero libbeat.output.failed a cause for concern? Are we losing data?

We are using Filebeat 7.4.1 in production shipping metrics as per configuration:

      filebeat.autodiscover:
          providers:
            - type: kubernetes
              ...
               
        processors:
          - add_cloud_metadata: ~
          - add_docker_metadata: ~

        .....
       
        monitoring:
          enabled: true

We have observed very detailed metrics around filebeat output such as this generated event (see below).

"beats_stats": {
      "metrics": {
        "registrar": {
          "writes": {
            "success": 1159964,
            "total": 1159964,
            "fail": 0
          },
          "states": {
            "cleanup": 175,
            "current": 92,
            "update": 13382777
          }
        },
        "filebeat": {
          "events": {
            "added": 13382781,
            "done": 13382777,
            "active": 4
          },
          "harvester": {
            "closed": 1282,
            "running": 7,
            "open_files": 7,
            "skipped": 0,
            "started": 1289
          },
          "input": {
            "netflow": {
              "packets": {
                "received": 0,
                "dropped": 0
              },
              "flows": 0
            },
            "log": {
              "files": {
                "renamed": 0,
                "truncated": 0
              }
            }
          }
        },
        "libbeat": {
          "pipeline": {
            "clients": 37,
            "events": {
              "retry": 79623,
              "active": 4,
              "total": 13382781,
              "filtered": 2863,
              "published": 13379918,
              "failed": 0,
              "dropped": 0
            },
            "queue": {
              "acked": 13379914
            }
          },
          "config": {
            "reloads": 0,
            "module": {
              "running": 0,
              "starts": 0,
              "stops": 0
            }
          },
          "output": {
            "events": {
              "active": 21,
              "toomany": 33147,
              "batches": 1159990,
              "total": 13413082,
              "acked": 13379914,
              "failed": 33147,
              "dropped": 0,
              "duplicates": 0
            },
            "write": {
              "bytes": 39927190886,
              "errors": 0
            },
            "read": {
              "errors": 0,
              "bytes": 3115564222
            },
            "type": "elasticsearch"
          }
        },
        "system": {
          "load": {
            "1": 0.1,
            "5": 0.11,
            "15": 0.16,
            "norm": {
              "1": 0.0063,
              "5": 0.0069,
              "15": 0.01
            }
          },
          "cpu": {
            "cores": 16
          }
        },
        "beat": {
          "info": {
            "uptime": {
              "ms": 2406820129
            },
            "ephemeral_id": "dbc0e27c-9743-4b4b-9b3c-6a38478e198a"
          },
          "memstats": {
            "gc_next": 50187360,
            "rss": 170217472,
            "memory_total": 840912263792,
            "memory_alloc": 25784432
          },
          "cpu": {
            "user": {
              "time": {
                "ms": 10803749
              },
              "ticks": 10803740
            },
            "system": {
              "time": {
                "ms": 9171426
              },
              "ticks": 9171420
            },
            "total": {
              "time": {
                "ms": 19975175
              },
              "value": 19975160,
              "ticks": 19975160
            }
          },
          "runtime": {
            "goroutines": 299
          },
          "handles": {
            "open": 21,
            "limit": {
              "soft": 1048576,
              "hard": 1048576
            }
          }
        }
      }

Luca_Belluccini · April 29, 2020, 12:44am

Are you using Stack Monitoring?

The entries "toomany": 33147 match with "failed": 33147.

In the context of Elasticsearch output, toomany is the number of requests which received a 429 TOO MANY REQUESTS from Elasticsearch (aka "Indexing Bulk rejections").

It means Elasticsearch is getting overwhelmed by the requests and tells Beats to slow down.

I would suggest to:

Verify you're using the Filebeat index templates which come with Filebeat
Verify the cluster is correctly sized compared to the amount of data being sent
- The cluster must not perform JVM GC too frequently. To do so, you need to keep the number of shards/indces low.
Filebeat can be tweaked to send "bigger" document bulks (defaults to 125 documents per bulk). It is usually more efficient to send bigger bulks.

You can check the rejections on the cluster with: GET /_cat/thread_pool/write?v.

Marcus_Simonsen · April 29, 2020, 1:41pm

hi Luca,
Thanks for the response.
We are definitely getting bulk write rejections, and as we've increased the volume of ingestion on our cluster.
We are using an appropriate filebeat template, with an Elastic pipeline to parse/process incoming data. We have increased our write bulk max size.

I need to quantify is how much data we are loosing. I see that retry is also non-zero (and large), which isn't a healthy sign either.
But I'm still wondering how much data we're losing and how we can remedy (in short and long term).
Short term filebeat tuning: increase num of retries, increase flush timeout. Spooling?

Longer term: Scale the cluster data nodes and shard count.
Introduce Logstash as an an intermediary buffer.

Any short term suggestions to stem the bleeding?

-marcus

Marcus_Simonsen · May 7, 2020, 1:01pm

Hello again,
I've seen no response to my question on the definitions of dropped vs failed output metrics. Could someone clarify that?

We see this surge of failed/dropped/retry counts typically when new filebeat hosts join, or when we purposely clear the filebeat registry (in testing environment). This causes a temporary surge of traffic to the Elastic cluster.

I was heartened to see that in the filebeat docs it states:
"
Filebeat ignores the max_retries setting and retries indefinitely.
"
So does this imply that we logs will eventually be published?

Thanks again!

system · June 4, 2020, 1:03pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Filebeat INFO output - libbeat.pipeline.events.filtered Beats filebeat	7	2438	February 16, 2018
Filebeat Stats and Metrics Beats elastic-stack-monitoring , elastic-stack-alerting , filebeat	2	2640	May 7, 2020
Incorrect Filebeat Metrics Beats elastic-stack-monitoring , filebeat	1	307	March 15, 2023
Filebeat losing logs Beats filebeat	1	1080	July 7, 2021
Filebeat publish metrics on start up to elastic and then stops publishing metrics Beats filebeat	6	957	November 21, 2017

Filebeat write errors

Related topics