Filebeat http stats: number of events in queue


(sergey arlashin) #1

Hi!

What parameter in filebeat stats which are exposed via http (http://localhost:5066/stats?pretty) shows the amount of existing events in spool queue ?

We are sending logs from filebeats to logstash and want to monitor filebeat's queue size.

Thanks!

Regards,
Sergey


(Steffen Siering) #2

'spool queue'? Do you use spooling to disk or the memory queue?


(sergey arlashin) #3

I'm using spooling to disk. Seems I named it incorrectly. sorry.

Currently the status output is the following

{
  "beat": {
    "cpu": {
      "system": {
        "ticks": 126531,
        "time": {
          "ms": 126531
        }
      },
      "total": {
        "ticks": 4275999,
        "time": {
          "ms": 4275999
        },
        "value": 4275999
      },
      "user": {
        "ticks": 4149468,
        "time": {
          "ms": 4149468
        }
      }
    },
    "handles": {
      "open": 194
    },
    "info": {
      "ephemeral_id": "82f7dec7-a787-417a-987d-35db893277ab",
      "uptime": {
        "ms": 70301782
      }
    },
    "memstats": {
      "gc_next": 36115744,
      "memory_alloc": 40274408,
      "memory_total": 468534822008,
      "rss": 1439768576
    }
  },
  "filebeat": {
    "events": {
      "active": 315,
      "added": 49756811,
      "done": 49756496
    },
    "harvester": {
      "closed": 86,
      "open_files": 1,
      "running": 1,
      "skipped": 0,
      "started": 87
    },
    "input": {
      "log": {
        "files": {
          "renamed": 0,
          "truncated": 0
        }
      }
    }
  },
  "libbeat": {
    "config": {
      "module": {
        "running": 0,
        "starts": 0,
        "stops": 0
      },
      "reloads": 0
    },
    "output": {
      "events": {
        "acked": 49753904,
        "active": 2332,
        "batches": 20379,
        "dropped": 0,
        "duplicates": 0,
        "failed": 4733,
        "total": 49760969
      },
      "read": {
        "bytes": 139128,
        "errors": 0
      },
      "type": "logstash",
      "write": {
        "bytes": 2905357460,
        "errors": 2
      }
    },
    "pipeline": {
      "clients": 1,
      "events": {
        "active": 315,
        "dropped": 0,
        "failed": 0,
        "filtered": 260,
        "published": 49756551,
        "retry": 1069725,
        "total": 49756811
      },
      "queue": {
        "acked": 49756236
      }
    }
  },
  "registrar": {
    "states": {
      "cleanup": 86,
      "current": 1,
      "update": 49756496
    },
    "writes": {
      "fail": 0,
      "success": 23349,
      "total": 23349
    }
  },
  "system": {
    "cpu": {
      "cores": 2
    }
  }
}

However when I stop logstash, I don't see any value that gets increased except for failed. But there definitely should be something that represents the amount of messages that are saved locally to disk but not forwarded to logstash.


(Steffen Siering) #4

Unfortunately we don't expose metrics for spooling to disk yet. For the spool we don't only want to collect number of events, but also add more metrics, like disk usage, number of IO ops and so on.

Spooling to disk is still in beta. You can follow this meta ticket for progress in beats and this one on go-txfile.


(sergey arlashin) #5

thank you for the answer. then probably there is some other way to detect that for example logstash is not fast enough to consume log flow from a filebeat (while using spooling to disk )?

also, if I switch to memory queue, what happens if I'm tailing fast rotating (once in 20 mins) files and logstash is not responsive for much longer period of time (say couple of hours)? How should I configure filebeat in this case?

I'm mostly concerned about windows hosts.


(Steffen Siering) #6

Relatively fast rotating files with potential "long" down-times is the use case we introduced spooling to disk into filebeat for. But even the on disk spool size is bounded in size. That is, it will have similar effects on filebeat once the spool is full as has the memory queue. Only that the spool file will allow you to buffer more data, giving you eventually a chance to deal with bursts of events or peek times. Still one needs enough bandwidth for Beats->Logstash to drain the spool/queue eventually.

Once the queue is full it creates back-pressure on the harvesters. If this happens they are slowed down or blocked completely until there is some new space in the queue. Depending on settings one has either data loss because files are closed or not picked up before they are rotated away, or one can run out of disk space, because filebeat keeps files open which have been deleted already. The metrics contain a harvester count. This get's you some idea about open file descriptors. The harvester count keeps increasing if there is not enough bandwidth to send all logs. Right now one sees back-pressure best by number of harvester and open file descriptors. We're planning to add per file metrics in order to measure the amount of back-pressure (See this issue).

As resources like bandwidth, memory and disk storage are bounded by physical and virtual limits, one has to drop events if not enough bandwidth is available, so to keep disk storage and memory usage within sane bounds. This can be achieved by limiting the number of harvesters and/or forcing filebeat to close a file after the file has been removed or renamed. We can minimise the risk of data loss with proper sizing
E.g. some on the envelop sizing + sand-bagging (benchmarking in order to validate assumptions/numbers is highly recommended):

  • assuming we want to buffer up to 24h (peak times or long network downtime)
  • assuming max log file is 100MB in size
  • "total log size =~ 24 * (60/20) * 100MB = 7.2GB"
  • adding filebeat meta-data => "total size =~ 1.5*7.2GB = 10.8GB"
  • "total require bandwith =~ 10.8GB / 24h =~ 450MB/h = 7.5MB/min = 125KB/s"
  • choose spool file between 15GB and 20GB, due to some sand-bagging for spool file + assuming 24h+ downtime, plus time to catch up (we try to be somewhat generous here)
  • assuming we have 24/h of data in our spool and we want to catch up with 4 times the speed logs are written (while writing live logs), then you need at least (4 + 1) * 125KB/s = 1.125MB/s available bandwidth to catch up. The faster you can catch up, the better.

Note: I'm thinking about adding tools to filebeat to inspect a spool file and even publish a spool file from CLI, without actually running filebeat. As spool files can be easily copied to another machine (format is machine independent), this can be used to copy a spool file to another computer for ingestion and offline inspection.

I'd say setting a harvester_limit is a good option. If you have more harvesters, then they all congest for the available bandwidth. But if you have less concurrent harvesters running, then each of them has more available bandwidth and a better chance to complete the current file. In case you have basically one log which acts kind of as a queue, you can even try to set harvester_limit: 1. This way the current log file its events will be send mostly in order (not interleaved with other files), so you have an idea about latency/delay in sending.
If harvester limit is small and a file is removed before it is picked up by filebeat, then this file is lost.

So to not loose complete files on rotation also set close_removed: true. This will close the file once filebeat detects that the file has been deleted. Non-processed contents will be lost, but at least no complete files.

Instead of just loosing any logs, you can make use of drop_events and exclude_lines to reduce the amount of events to be send. We might eventually introduce QoS rules like rate limiting and sampling in the future.