I have a filebeat agent that will be sending quite a lot of access log data and our current setup have 4 receiving logstash hosts. I would like to benefit from these to gain a maximum throughput.
When I read the «workers» setting description here, I'm a little puzzled:
The number of workers «per configured host» publishing events to Logstash. This is best used with load balancing mode enabled. Example: If you have 2 hosts and 3 workers, in total 6 workers are started (3 for each host).
The first part : number of workers per configured hosts publishing events to Logstash.
Which host does the «per configured hosts» means?
The host running the agent?
The hosts in the provided destination list?
Sorry for this one, maybe it's already stated clearly, but english is not my native...
So, if I have 4 target hosts and want to benefit from this, should I specify 4 workers?
best advice is measure, measure, measure. This post contains a python script + instructions how you can get throughput info right from filebeat. For testing have a log file prepared (I often use NASA HTTP logs) and delete registry files between runs. (optionsl) In addition use the null (or stdout with dotted codec and pv tool) output plugin in logstash to not generate any back-pressure from logstash.
The worker: ... config is really per host. The default value is 1. That is if H=# of hosts and W=# worker, then H*W workers doing output will be spawned. In your sample config it means you're spawning like 16 workers pushing data in total.
There are 2 options to try in your case:
set filebeat.publish_async: true. This will push batches as soon as batches are ready into the publisher pipeline. In this case I'd set spool_size between [bulk_max_size, bulk_max_size * worker * (# of hosts)]. If one logstash instance is not responding (or slowing down), filebeat continues publishing events using the other workers (due to async/pipelines publishing).
set filebeat.spool_size = output.logstash.bulk_max * output.logstash.worker * len(output.logstash.hosts). This will split bathces into output.logstash.worker * len(output.logstash.hosts) batches when publishing, so every host gets it's share. Drawback is, if one logstash instance slows down it first takes some timeout to detect it's not responding and transmitting the sub-batch via another logstash instance, basically blocking output until all sub-batches have been ACKed.
My gut feeling tells me option 1 would have better throughput (despite gut feelings being often wrong), but in the end it's up to you to run experiments and measure your setup to figure some good configurations matching your requirements. And don't forget, the higher your throughput, the more resources will be required by filebeat and logstash to process your data.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.