Hello, i have built ELK solution for processing/storing/visualizing syslog data from certain number of highly utilized Cisco ASA firewalls. My elasticsearch cluster setup is: 6 data nodes, one master and one client node. On separate machine (8 cores, 16GB RAM) i have three Logstash instances gathering data from three different UDP syslog streams, processing and loading this data into three different elasticsearch indices through elasticsearch client node.
Each Logstash instance is receiving between 7k and 12k of syslog events per second, and it generally seems to work. Data is loaded into datastore and i am able to use Kibana dashboards to manipulate/visualize event data. My problem is that Logstash is apparently losing some part of syslog events. Test events generated "by hand" never appear (or appear only partially) in the datastore, and cannot be found through dashboard.
So my question is - how to troubleshoot this problem and how to prevent Logstash from losing syslog data? I have tried increasing pipeline worker count and batch size, but this doesn't seem to help.
Thanks,
Wojtek
Logstash doesn't have any internal buffering (to speak of), so it could very well be the ingestion rate of ES that's limiting you. What's the bottleneck in your setup? What kind of performance can you get if you just blackhole the messages in Logstash (just a drop filter, for example)?
Hi Magnus, thanks for your answer. How could i measure performance of such configuration? I understand that when i'll blackhole events in Logstash, they'll never get to the datastore?
Write a small program that sends UDP datagrams at a high rate and measure on the Logstash side how many get through. You may want to write them to a file in order to do the measurement. If you put a sequence number in each message you can check afterwards how many messages are dropped and whether you get a drop here and there or whether consecutive batches are dropped.
There's probably tuning to be made in the kernel too.