We have a cluster configured with 5 logstash servers and 30 ES servers. A single host is exporting logs from a single file across the 5 logstash servers in a load balanced configuration using the filebeat logstash output. (load_balance: true, worker: 2)
When this was originally configured using ELK 6.2 we were seeing an ingest rate of about 96k events per second max. In recent weeks the volume of logs started to go up and our stack started to fall behind. We upgraded filebeat, logstash, es, and kibana to the latest 6.x version and noticed a drop in ingest rates to around 60k/second.
Now the odd thing is, because our stack can't keep up we've started seeing log rotation come into play. The filebeat service will fall far enough behind that it will often be reading from the current log file as well as 1 or 2 rotated log files before they are compressed and archived.
While reading from the one file we get the aforementioned 60k/sec ingest rate, but when the logs are rotated and filebeat is reading from 2 or more files at a time the ingest rate jumps up to 80k+/sec. All of the files are located on the same physical partition (AWS NVME) so I don't think it's an iOPS limit we're hitting.
I've tried playing with various settings including:
- queue.mem.events
- queue.mem.flush.min_events
- queue.mem.flush.timeout
- bulk_max_size
- pipelining
- compression
- worker
But none of these seem to improve at all upon the 60k/sec ingest rate. Now we know the stack itself can handle a higher rate, as it floats along just fine when multiple files are being read.
Due to this problem we're looking at moving away from filebeat to a direct syslog->logstash flow. But in the meantime I'd really like to figure out a way to match the performance we see when reading from multiple files vs. one file. Any thoughts on why we'd see a 30% gain in performance when reading from more than one file?