File Beat performance - How to check if file beat is keeping up while reading files

We have an ELK 5.5.2 environment where we have 3 master enable nodes, 3 coordinating nodes and 8 data nodes. The servers in this environment have 8 cores, 64gb ram and 200gb of storage on the data nodes.

We are sending the syslogs for ELK servers to a remote rsyslog receiving linux server where we have file beat installed. This server is not part of the ELK environment - it is a specialized server for collecting syslogs from other linux servers. File beat is forwarding the syslogs and it own logs to the ELK environment directly. FB -> ES.

When we query the data in Kibana we are seeing a few minute lag. In other words we have no data for the last x minutes from the server where we have filebeat installed.

How do we tell if filebeat is not keeping up. If it is falling behind while reading the logs. Is there a way to have filebeat print out the position that it currently reading from and the position of the EOF?

We don't report the last offsets and file sizes as metrics yet. Feel free to open an enhancement request.

filebeat write the last ACKed file states into the registry file. The file contains all states as json array. You can open and check the file at any time.

Keep in mind, it's not only filebeat, but also Elasticsearch determining the maximum throughput. To get some idea how fast filebeat can process your log-files you can test filebeat with console output like this (delete the registry file between runs):

filebeat -E output.elasticsearch.enabled=false -E output.console.enabled=true | pv -Warl >/dev/null

The pv tool will print throughput in number of lines (number of events) to stderr. As life logs are likely to be in the FS-Cache (still in main memory), you might try to run the tests multiple times and number of input IO ops (use time command).

Next have a look at indexing rates in Elasticsearch (e.g. via Monitoring or stats API). You already have some historical data, no need to run the tests again.

How did you configure the elasticsearch output in beats? Which nodes do you send the events to? Have you adapted the bulk_max_size setting? Default of 50 events per batch is quite small and mostly suffice very very small deployments. Check if you can ramp up bulk_max_size without noticing any errors or regressions in throughput.

I see the following entries in the registry file for our file:
{"source":"/opt/pki/syslog/messages","offset":20548686,"FileStateOS":{"inode":134305414,"device":64781},"timestamp":"2017-10-06T10:31:14.414681177-04:00","ttl":-1},
{"source":"/opt/pki/syslog/messages","offset":29649281,"FileStateOS":{"inode":134305414,"device":64786},"timestamp":"2017-10-05T14:38:04.750864849-04:00","ttl":-1},
{"source":"/opt/pki/syslog/messages","offset":30629013,"FileStateOS":{"inode":134305414,"device":64782},"timestamp":"2017-10-05T14:38:04.750865654-04:00","ttl":-1},

It appears that the offset field is the field I am interested in it has values of

  • 20548686
  • 29649281
  • 30629013
    How does that related to size field from ls -ld.
    -rw-rw-r-- 1 root root 20933631 Oct 6 10:34 messages

We configured file beat to send the log entries to the three master enabled nodes.

We are using the default file beat configurations.

for this command
filebeat -E output.elasticsearch.enabled=false -E output.console.enabled=true | pv -Warl >/dev/null
We are seeing the following out put. The documentation for pv command is quite sparse. If I understand the the documentation the the first values is the rate in lines/second and the second value is average rate in lines/second.

  • [0.00 /s] [4.98k/s]
  • [10.6 /s] [4.13k/s]
  • [9.54 /s] [3.62k/s]
  • [0.00 /s] [3.37k/s]
  • [23.3 /s] [2.89k/s]
    is an average rate 2.89K/s and 4.98k/s a tlow throughput for file beat.

On additional runs I am seeing a much higher average rate. The throughput seems quite variable.

  • [18.4k/s] [29.0k/s]
  • [0.00 /s] [24.3k/s]
  • [0.00 /s] [20.9k/s]
  • [0.00 /s] [18.3k/s]
  • [0.00 /s] [16.3k/s]
  • [1.08k/s] [14.8k/s]
  • [0.00 /s] [12.3k/s]
  • [7.74 /s] [7.40k/s]
  • [19.3 /s] [4.24k/s]
  • [8.71 /s] [3.71k/s]

We are not sure if these rates are high or low for filebeat.

After removing the registry file. The environment has stopped lagging. We are seeing logs from the last 30 seconds

Did you run your tests with an pre-existing log file or did you rely on applications writing logs? When testing, try to have as little variability in your environment. In case you're using a pre-existing file, something is really off here. The deviation in rates is much too big...

We have about 19 RedHat Linx 7.2 servers sending there syslogs to a a load balancer. The load balancer distributes the syslogs to 3 RedHat Linx 7.2.

The 3 servers behind the load balanced send the logs to Elastic via filebeat. The rate can vary depending on what is going on in the 19 servers and how the L.B. distributes the load.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.