Insufficient throughput from Filebeat

We recently deployed Filebeat as part of an ELK deployment in our production environment. Generally, things have gone well, but Filebeat appears to be having a very difficult time keeping up with some of our highest, point-source log volumes. I am wondering if anyone has advice on how to increase the throughput.

Some background:

  • We are running Filebeat 1.0.1 on Centos 6.6.
  • We are running Filebeat on about 620 hosts, load-balancing into the logstash cluster.
  • We are running Logstash 2.1.1 on Centos 6.6 (with logstash-input-beats v2.1.2, and logstash-output-elasticsearch v2.3.0)
  • We are pushing logs into a four-node logstash cluster. Each node in the cluster is a virtual machine with 8CPUs, 4GB RAM.
  • We are indexing into a hybrid virtual/physical ES cluster, with client and master nodes running on virtuals, and data nodes running on physicals.

Our work-loads are highly cyclic, with daytime peaks many times larger than the quieter, nighttime hours. During these peaks, the highest volume, point source logs may emit ~3500 lines per second. In the current phase of integration, there are about 8 hosts that fall into this category.

During these high volumes, we begin to fall behind, and it would certainly appear that some combination of Filebeat or the Filebeat input plugin for Logstash are to blame. I feel fairly confident that this is the case because our logstash processing hosts are barely working, running at around 20% CPU. Stack traces taken on the logstash process consistently show that most threads are waiting around for something to happen. It is rarely the case that I will catch one working at all.

I have already taken a look at this issue, and applied the related change of bumping the spool_size and bulk_max_size to 2048. The change seems to have increased throughput a bit (maybe 30%), but not enough to make a huge difference.

If anyone would like additional detail, please let me know. In the meantime, I will continue investigating, and include any additional details on this thread.

have you checked CPU/memory usage on some of your filebeat hosts? Have you checked e.g. with iotop on some of your servers about disk based slowdowns (e.g. looong wait times)? You checked if you're experiencing network congestion?

CPU and memory appear to be healthy across the board. The CPU/Memory on both the filebeat and logstash side are generally quite idle (80% - 90%), with sufficient process and overall system memory available.

Network usage is well within the norms for the hosts (single or low double-digit Mb/s). Further, the overall network is in good shape.

Disk activity isn't particularly interesting. There is a decent number of writes taking place (including some from filebeat -- the registry file?), but essentially no reads. This is likely because most of the reads are hitting the OS file cache. Overall, the disk service times seem pretty normal.

Depending on your ram maybe increase harvester_buffer_size?

Have you tried different combinations of hosts, worker and load_balance:true ?

I also noticed it is kinda slow. I thought to myself what if I add multiple beats inputs on each logstash server and see if that increases throughput. I guess you would have to test to find out but just some thoughts!

Also my thoughts on harvester buffer size are basically how much of the log file it consumes into memory for pushing out. If you have it set to say 16,384 (which is default) then that divided by 4 hosts is 4096 per host. Maybe adjust to 4096 spool_size so its not having to empty spool twice for every harvest?

Also in the logstash logs are you seeing any pipeline blocked messages? I think thats usually a sign logstash backpressure. If not, I think you could push more/larger spool_size?

Hi,

Are you using network shares or are these files all local? Have you checked all files being actually forwarded? Can you share your filebeat config file?

I'd like to run some tests first. Unfortunately you will need 1.1 for these tests, as we started to insert some counters in critical places.

https://download.elastic.co/beats/filebeat/filebeat_1.1.0-SNAPSHOT_amd64.deb
https://download.elastic.co/beats/filebeat/filebeat-1.1.0-SNAPSHOT-x86_64.tar.gz
https://download.elastic.co/beats/filebeat/filebeat_1.1.0-SNAPSHOT_i386.deb
https://download.elastic.co/beats/filebeat/filebeat-1.1.0-SNAPSHOT-x86_64.rpm
https://download.elastic.co/beats/filebeat/filebeat-1.1.0-SNAPSHOT-darwin.tgz
https://download.elastic.co/beats/filebeat/filebeat-1.1.0-SNAPSHOT-i686.tar.gz
https://download.elastic.co/beats/filebeat/filebeat-1.1.0-SNAPSHOT-i686.rpm
https://download.elastic.co/beats/filebeat/filebeat-1.1.0-SNAPSHOT-windows.zip

Here is a python script to collect these counters and compute rates:

import requests
import argparse
import time
import curses


def main():
    parser = argparse.ArgumentParser(
        description="Print per second stats from expvars")
    parser.add_argument("url",
                        help="The URL from where to read the values")
    args = parser.parse_args()

    stdscr = curses.initscr()
    curses.noecho()
    curses.cbreak()

    last_vals = {}

    # running average for last 30 measurements
    N = 30
    avg_vals = {}
    now = time.time()

    while True:
        try:
            time.sleep(1.0)
            stdscr.erase()

            r = requests.get(args.url)
            json = r.json()

            last = now
            now = time.time()
            dt = now - last

            for key, total in json.items():
                if isinstance(total, (int, long, float)):
                    if key in last_vals:
                        per_sec = (total - last_vals[key])/dt
                        if key not in avg_vals:
                            avg_vals[key] = []
                        avg_vals[key].append(per_sec)
                        if len(avg_vals[key]) > N:
                            avg_vals[key] = avg_vals[key][1:]
                        avg_sec = sum(avg_vals[key])/len(avg_vals[key])
                    else:
                        per_sec = "na"
                        avg_sec = "na"
                    last_vals[key] = total
                    stdscr.addstr("{}: {}/s (avg: {}/s) (total: {})\n"
                                  .format(key, per_sec, avg_sec, total))
            stdscr.refresh()
        except requests.ConnectionError:
            stdscr.addstr("Waiting for connection...\n")
            stdscr.refresh()
            last_vals = {}
            avg_vals = {}

if __name__ == "__main__":
    main()

The counters are queried via http. I assume you don't want to have an open port, so we make the beat to bind the port to localhost only (the python script must be run locally in this case).

run.sh:

#!/bin/sh
REGISTRY=.filebeat
rm -f $REGISTRY
filebeat -e -httpprof 127.0.0.1:6060 -c $1

The -httpprof 127.0.0.1:6060 creates a small http server for exposing some counters. The script awaits the configuration file to run filebeat with.

Start the python script like this (can be active all the time):

$ python expvar_rates.py http://localhost:6060/debug/vars

There are quite some variables in here. I'd like to reduce them first to get an idea of I/O. To do so, get your config file and comment out the logstash output plugin. Instead enable the console output plugin. Plus, reconfigure the registry path (and adapt the shell script), so you don't overwrite your actual registry.

When having console output it's a good idea to call run.sh like this:

$ ./run.sh test.yml > /dev/null

Consider doing the test (without sending to /dev/null) with only one prospector configured at a time, to eliminate the case log files are missing due to faulty config.

Doing this test with Nasa HTTP logs I get about 75k lines/s with default configuration.

On some hosts, the files are local on SSD. On others, they are being served (along with everything else on the VM) from remote storage (gluster). That said, there is no current evidence that I/O latency is the dominator in this case; I/O measurements all seem snappy.

My filebeat configuration is broken into several files, with prospectors defined as contributions in a conf.d style arrangement. A slightly munged sample of this is below.

/etc/filebeat/filebeat.yml
---
filebeat:
config_dir: /etc/filebeat/conf.d
idle_timeout: 5s
registry_file: /var/lib/filebeat/registry
spool_size: 2048
logging:
files:
keepfiles: 5
name: filebeat.log
path: /var/log/filebeat
rotateeverybytes: 10485760
level: info
to_files: true
to_syslog: false
output:
logstash:
bulk_max_size: 2048
hosts: ['logproc001','logproc002','logproc003','logproc004']
loadbalance: false
max_retries: -1

The prospector contribution file for haproxy:

---
filebeat: 
  prospectors: 
  - 
    document_type: haproxy-http
    paths: 
      - /var/log/haproxy/haproxy-http.log
  - 
   document_type: haproxy-tcp
   paths: 
      - /var/log/haproxy/haproxy-tcp.log
  - 
    document_type: haproxy-notice
    paths: 
      - /var/log/haproxy/haproxy-notice.log

I will make an effort to run your proposed test within the next day or two. I would very much like to collect numbers from a relatively quiet period, as well as one during peak daily demand. Please stay tuned for those results.

One final aside... I did perform another experiment last evening, disabling the 'loadbalance' feature (the configs above include that change). That single change appears to have alleviated the most acute symptoms of the problem; loads that were causing significant backups during the last several days were not an issue today, and I see significantly higher peak throughput on our busiest point sources.

I can only speculate on why this might be the case, but it may point toward some issue in filebeat with the load balance mechanism, or some issue on the input plugin side when dealing with larger numbers of incoming sockets. If this tidbit is something you think we could test using your methodology above, please let me know. I could potentially collect the metrics with and without loadbalance enabled.

Some results from my development VM using the 64 bit rpm linked above.

Filebeat configuration:

---
filebeat: 
  config_dir: /etc/filebeat/conf.d
output: 
  logstash: 
    hosts: ['localhost']

...and...

filebeat: 
  prospectors: 
  - 
    document_type: pmta-acct-csv
    paths: 
         - /var/log/pmta/*.csv

Logstash configuration:

input {
  beats {
    port => 10200
    tags => []
  }
}
output {
  null {
  }
}

The test used a 1 million line CSV file, with a total size of 278MB.

Filebeat and Logstash were run on the same 4CPU VM. The python process was run remotely.

'top' sample:

top - 15:10:54 up 11 days,  5:33,  2 users,  load average: 0.85, 0.45, 0.63
Tasks: 226 total,   2 running, 224 sleeping,   0 stopped,   0 zombie
Cpu(s):  9.6%us,  2.3%sy, 19.1%ni, 68.6%id,  0.1%wa,  0.0%hi,  0.0%si,  0.3%st
Mem:  16269360k total,  3455504k used, 12813856k free,   268964k buffers
Swap:   786428k total,       52k used,   786376k free,  2145084k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                      
20090 logstash  39  19 5067m 386m  14m R 71.8  2.4   0:59.31 [beats-input]>c                                                                                               
17242 logstash  39  19 5067m 386m  14m S 11.3  2.4   1:58.21 [beats-input]-b                                                                                               
17241 logstash  39  19 5067m 386m  14m S  8.6  2.4   1:42.21 >output                                                                                                       
20082 tom.dewi  20   0  457m  25m 4752 S  8.3  0.2   0:04.12 filebeat                                                                                                      
20097 tom.dewi  20   0  457m  25m 4752 S  7.0  0.2   0:02.68 filebeat                                                                                                      
20092 tom.dewi  20   0  457m  25m 4752 S  6.6  0.2   0:04.31 filebeat                                                                                                      
20087 tom.dewi  20   0  457m  25m 4752 S  6.0  0.2   0:04.08 filebeat                                                                                                      

iostat sample:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           9.03   20.26    1.70    0.03    0.37   68.61

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
xvda              0.00     0.00    0.00    1.60     0.00    12.80     8.00     0.00    1.62   0.62   0.10
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-1              0.00     0.00    0.00    0.10     0.00     0.80     8.00     0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00    1.50     0.00    12.00     8.00     0.00    1.73   0.67   0.10
dm-3              0.00     0.00    0.00    1.50     0.00    12.00     8.00     0.00    1.73   0.67   0.10
dm-4              0.00     0.00    0.00    1.50     0.00    12.00     8.00     0.00    1.73   0.67   0.10
dm-5              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

There are no disk reads taking place. It woudl appear the entire file is in the file cache.

Filebeat metrics @ about halfway:

libbeatMessagesDropped: 0.0/s (avg: 0.0/s) (total: 0)
libbeatEsPublishEventsCallCount: 0.0/s (avg: 0.0/s) (total: 0)
libbeatLogstashPublishedAndAckedEvents: 2580.31814814/s (avg: 2490.59748679/s) (total: 448512)
libbeatLogstashPublishedButNotAckedEvents: 0.0/s (avg: 0.0/s) (total: 0)
libbeatEsPublishedButNotAckedEvents: 0.0/s (avg: 0.0/s) (total: 0)
libbeatPublishedEvents: 2580.31814814/s (avg: 2490.59748679/s) (total: 450560)
libbeatLogstashPublishEventsCallCount: 1.25992097077/s (avg: 1.21573430878/s) (total: 220)
libbeatMessagesInWorkerQueues: 0.0/s (avg: 0.0/s) (total: 0)
libbeatEsPublishedAndAckedEvents: 0.0/s (avg: 0.0/s) (total: 0)

During several iterations of the test, the "published events" rate never got much above ~2500 events/second.

As I mentioned, I will attempt to move this test out to production hardware early next week. If there are any other configuration tweaks you would like me to try, please let me know.

I should add that in a later iteration, I moved the .registry file to a ram disk to eliminate the possibility that writing the registry file was causing a hidden delay. There was no difference in the overall throughput.

You have like 3 logstash instances? Is this logstash process dealing with input from filebeat: [beats-input]>c

The filebeat processes do not seam to be very busy. likely due to back-pressure by logstash. I wonder how much slow-down is experienced due to running everything in a VM. Will do some tests in VM when I find time. Best option in 1.0.1 and 1.1 is setting spool_size and bulk_max_size to higher values and see if there is a difference. In our development branch we've added an option to disable compression and improve load-balancing support for filebeat.

That was the 'threads' view in top (top -H), so it was showing the activity of invidivual threads. There was only one instance of filebeat and logstash.

Time permitting, I will attempt some additional scenarios (higher spool/bulk, running on physical) to see if the results change.

Thanks for all of the suggestions so far.

I ran the same test this morning on a significantly more powerful physical host. This resulted in significantly higher throughput, as you can see in the results below.

It seems likely that the throughput can be explained by some combination of virtualization, and smaller core counts (32 vs. 4). I will continue to run experiments to see if I can learn more. For instance, I am interested in which side of the pipeline seems to be the most affected, filebeat, or logstash. I will also see if I can get better results on a virtualized instance with higher core counts.

Results of physical to physical, using configuration as shown in my previous test:

libbeatMessagesDropped: 0.0/s (avg: 0.0/s) (total: 0)
libbeatEsPublishEventsCallCount: 0.0/s (avg: 0.0/s) (total: 0)
libbeatLogstashPublishedAndAckedEvents: 15238.5461335/s (avg: 10872.1883083/s) (total: 550912)
libbeatLogstashPublishedButNotAckedEvents: 0.0/s (avg: 0.0/s) (total: 0)
libbeatEsPublishedButNotAckedEvents: 0.0/s (avg: 0.0/s) (total: 0)
libbeatPublishedEvents: 15238.5461335/s (avg: 10872.1883083/s) (total: 552960)
libbeatLogstashPublishEventsCallCount: 7.44069635423/s (avg: 5.30851850752/s) (total: 270)
libbeatMessagesInWorkerQueues: 0.0/s (avg: 0.0/s) (total: 0)
libbeatEsPublishedAndAckedEvents: 0.0/s (avg: 0.0/s) (total: 0)

machine specs?

I tested on physical 4 core i7 3.4GHz machine (+ hyperthreading enabled - makes 8 - kinda) + 32GB RAM. Measured throughput was at 18k lines/s.

edit: 32GB RAM

For the virtual:

Linux xxxx 2.6.32-504.23.4.el6.x86_64 #1 SMP Tue Jun 9 20:57:37 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
4x Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
16 GB RAM

For the physical:

Linux xxxx 2.6.32-504.23.4.el6.x86_64 #1 SMP Tue Jun 9 20:57:37 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
32x Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz
64 GB RAM

I am assuming that both of these are running with hyper-threads, so real core count is 1/2.

I really think there is a slow down in the logstash input side I have created this issue https://github.com/logstash-plugins/logstash-input-beats/issues/45

I tend to agree, and I'll be following that bug. None of the experiments I have run over the last couple of days are significantly different than what is being shown in your linked issue.

If something does change, I will update this thread.

@steffens:
This python script has been really helpful in benchmarking filebeat. This should somehow be captured / merged into the git repo.

Thoughts?

1 Like

While nothing in development yet, we're thinking about how monitoring beats themselves can be improved. I don't think the current solution with exposing of counters as is and script is where we're heading to. Currently the monitoring support adds a profiling endpoint allowing anyone to manipulate your beat if accessible. The script is more of a hack right now and variations of it shall not be used in production.

For now I created a gist: https://gist.github.com/urso/de920f9b3d54c2c48ed5278906765066

The script was very useful for me too.