We recently deployed Filebeat as part of an ELK deployment in our production environment. Generally, things have gone well, but Filebeat appears to be having a very difficult time keeping up with some of our highest, point-source log volumes. I am wondering if anyone has advice on how to increase the throughput.
Some background:
- We are running Filebeat 1.0.1 on Centos 6.6.
- We are running Filebeat on about 620 hosts, load-balancing into the logstash cluster.
- We are running Logstash 2.1.1 on Centos 6.6 (with logstash-input-beats v2.1.2, and logstash-output-elasticsearch v2.3.0)
- We are pushing logs into a four-node logstash cluster. Each node in the cluster is a virtual machine with 8CPUs, 4GB RAM.
- We are indexing into a hybrid virtual/physical ES cluster, with client and master nodes running on virtuals, and data nodes running on physicals.
Our work-loads are highly cyclic, with daytime peaks many times larger than the quieter, nighttime hours. During these peaks, the highest volume, point source logs may emit ~3500 lines per second. In the current phase of integration, there are about 8 hosts that fall into this category.
During these high volumes, we begin to fall behind, and it would certainly appear that some combination of Filebeat or the Filebeat input plugin for Logstash are to blame. I feel fairly confident that this is the case because our logstash processing hosts are barely working, running at around 20% CPU. Stack traces taken on the logstash process consistently show that most threads are waiting around for something to happen. It is rarely the case that I will catch one working at all.
I have already taken a look at this issue, and applied the related change of bumping the spool_size and bulk_max_size to 2048. The change seems to have increased throughput a bit (maybe 30%), but not enough to make a huge difference.
If anyone would like additional detail, please let me know. In the meantime, I will continue investigating, and include any additional details on this thread.