Well, I've found this, this and this being relevant enough. Still no real success, unfortunately.
I tried to find the bottleneck by measuring throughput at different parts of a "pipeline", and:
- Filebeat alone (using "output.console") is able to read my logs locally at 25k/s rate.
- When I use remote output, no matter if it's logstash (even with all the filters commented out) or elasticsearch - rate drops to the same 4k-5k/s. When I mention "filebeat with logstash output here", I mean logstash "stdout" output plugin, measuring it's performance both with "pv" and with filebeat internal monitoring.
I also tried to use some "dummy log" as a data source (randomly generated file with lines much shorter than I have in my original files) - filebeat alone started processing it at 45k/s rate, but, using remote logstash/elasticsearch, rate increased only to 6k/s.
Since the main idea of the links I mentioned relates to bulk/batch sizes and workers count (both for filebeat and logstash) - I've also tried to play with them.
Right now, my logstash has in it's pipelines.yml :
- pipeline.id: "pipeline1"
path.config: "/usr/share/logstash/pipeline/pipeline1"
pipeline.ordered: false
pipeline.workers: 96
pipeline.batch.size: 1024
And filebeat.yml:
queue.mem:
events: 600000
flush.min_events: 512
flush.timeout: 5s
# ...
output:
logstash:
loadbalance: true
workers: 12
pipelines: 4
bulk_max_size: 1024
hosts:
- <20 different ports of my logstash here>
By my logic, this should be much more than enough - but no. Looking at logstash flow stats :
...
"worker_concurrency" : {
"current" : 8.992,
"last_1_minute" : 5.544,
"last_5_minutes" : 5.699,
"last_15_minutes" : 7.645,
"last_1_hour" : 8.096,
"lifetime" : 8.02
}
...
- it's not even close to the number of available workers from both sides.
What I have also noticed: when I made some tests to measure rates between filebeat and logstash (using one instance of filebeat with multiple ports of the same logstash) - I was able to reach something like 5k-6k/s rate. But now, when I use multiple filebeats - total logstash rate is the same, 5k-6k/s for all filebeats combined.
So, by logic, it should mean that bottleneck is either network or logstash itself (or the physical node where it's hosted). But:
- Logstash "flow stats" (mentioned above) show, that Logstash workers are far from saturated
- I have tested network between one filebeat instance and logstash instance with iperf, and bandwidth easily reached 500 Mbps.
So far, it seems to me, that both filebeat and logstash are doing pretty fine alone (and elasticsearch is out of the question because even "filebeat->logstash" test (without ES) shows poor performance), but when I try to combine them - something is not working out. Like filebeat just doesn't want to send data fast enough for some unknown reason.