3 Filebeat agents on 3 servers.
1 Logstash instance.
3 Elastic data nodes.
Looking through the Kibana monitoring, Logstash indicates it's receiving about 1,500 events per second and emitting 1,500 events per second which indicates that Logstash is not the bottle neck.
But regardless how many logs I write I can't seem to get to send to Logstash more then 1,500 from the 3 Filebeat agents.
My application does 4 log entries at 2,500 bytes total per "business" request.
At first try to get an idea on how fast filebeat can actually read in events. For this have filebeat with console output only. e.g. run filebeat in terminal via
Also test with most recent 6.0 release. We've made some improvements, which should improve throughput.
by default the go-runtime creates as many OS worker threads as CPUs are available. With hyper-threading and resource contention with other processes in the system I sometimes didn't find this to be the most optimum. Sometimes reducing the number of OS workers actually helps. Try running with -E max_procs=N to find a good value of OS threads.
If throughput filebeat->logstash is much lower then filebeat->console, the loss of event rate is due to backpressure by network and/or Logstash.
These rates vary too much, as you are benchmarking on live logs. That is, the results vary due to the log lines writing rates varying quite a lot.
The registry file keeps track of the last lines processed. That is, for benchmarking you want to have a copy of your logs and a separate registry file, which you can safely delete between runs. Then you will also get more robust rates.
did you delete the regsitry file between restarts? This high variance in rates looks very fishy too me.
Potential reasons for such high variance I can think of:
processing live logs with log producer having a varying rate in logs beeing written
some resouce overload scenario like:
some kind of network share or network based storage with network or storage being under pressure
memory pressure + too much swapping
not enough CPU available for beats due to container or VM environment with much too many active VMs
too many processes doing disk IO, putting disk and file cache under pressure => run with time and check I/O ops being low while benchmarking. Low IO ops indicate file under test is mostly cached
Huh? You have a few thousands of files in that directory?
Harvesting/pushing files is done concurrently to the routine checking for new files. If you have only 1 or 2 CPUs assigned to filebeat, this might affect throughput.
Hi sorry for confusion. If you look at my original config in the posts above... I have 2 harvesters.
1st harvester: For the containers, this has nested paths which @warkolm asked how many paths there was.
2nd harvester: DC/OS Mesos logs, single log file.
I stopped filebeat and let the logs accumulate a bit, most of them where the dc/os mesos logs as the container applications where not realy active at that time. When I restarted filebeat and looked at logstash metrics, it was able to process 3K logs per second.
So I was wondering if the slowdown for the containers is due to the nested paths that @warkolm mentioned . DC/OS and Mesos will create new folders every time we launch a new container or restart existing one.
So far I don't think it's logstash or elasticsearch as the events received and emited by logstash are on par with each other and match.
Unless you have a few hundreds of path/sub-paths (also account for files in sub-directories), I don't think scanning for files should affect the throughput much.
One problem with the tuning efforts I have is the inconsistencies in read throughput, even when testing with console output. Anyways, you might consider increasing the spool size significantly, so you can buffer more events during read-throughput-peaks. On the output size set loadbalance: true and increase the number of workers (e.g. worker: 3). A batch of events in the spooler is split into N = spooler size / bulk_max_size batches, which are finally load-balanced onto the configured workers. That is N should be a multiple of workers. Only after all sub-batches have been processed, will the next set of batches be forwarded to the outputs (lock-step loadbalancing). Filebeat 6.0 has better support for async publishing and load balancing.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.