New to troubleshooting Logstash performance issues. I have a cluster of 3 load balanced virtual servers running Logstash that forward on to Elasticsearch in the cloud. We've noticed that we're not getting all of our logs from Palo Alto Panorama to Elasticsearch. Panorama (CEF format) forwards to the load balancer which forwards to the Logstash servers to the cloud. We're seeing syslog connection broken messages in Panorama and traffic captures on the Logstash servers show TCP Zero Windows which is telling me Logstash isn't pulling logs from the TCP buffer quickly enough. The pipeline currently has defaults:
pipeline workers: 1
pipeline batch size: 125
pipeline batch delay: 50
queue type: memory
queue max bytes: 1 GB
queue checkpoint wrties: 1024
Log volume is between 18 million to 26 million a day plus whatever we're missing.
Heap size in jvm.options per Logstash server is at 16 GB.
My first thought is to increase pipeline workers, but unsure with that log volume how many. Another thought is to split off the Palo Traffic logs since they are the noisiest from the rest by forwarding them on a different port to a separate pipeline. Or maybe I need to add a 4th Logstash server?
Any suggestions or links to webinars on performance would be helpful.
There is any reason to use the pipeline.workers set to 1 ? This can have a big impact in your ingestion performance.
How is the CPU usage of your machine? Do you have anything else running on those machines or any heavy logstash pipeline?
Something around 30 million events per day is not that much, even less when you are load balancing between 3 servers.
I would use the default number of pipeline workers for your Palo Alto pipelines, in your case the default would be 10, as this is the number of CPUs in the machine.
You could also increase the pipeline batch size to something like 500.
Our Elastic Co consultant had us leave them all alone when we were initially setting things up.
We have Infoblox logs coming into the same servers roughly 130 million per day with the same default pipeline settings - also getting TCP Zero Windows for it. That's the only other heavy hitter going through those Logstash servers. There are 5 other log sources in addition, but significantly less volume for each. Each data source has its own pipeline.
I would change both the pipeline workers and the batch size and see if it improves.
The number of pipeline workers determines how many workers in parallel will process your events in the filter and output block, with 1 worker, you have ony worker processing your events, basically you are processing one message after the other without taking advantage of the 10 CPUs your server have.
Just as an example, I have a ingestion process that uses 2 Logstash servers, each one with 12 vCPUs, 16 GB of RAM and 5 GB for Logstsash Heap, each one of them is able to process something around 400 million messages from network devices per day without saturating the CPU.
If you want to be cautious, increase the workers to half of the number of the CPUs and leave the batch size untouched (which is the default), but you should test those values later until you see that your servers are getting satured.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.