I took a packet capture and saw some traffic on port 5044, but I'm not able to decode it visually. We're not using SSL, but it's not plain text either.
The traffic is compressed, which you can disable on the output using compression_level: 0
Logstash output settings for filebeat
Loading the capture into wireshark should allow you to view the compressed body or you can disable compression which should show the body in tcpdump
We turned off compression and it turns out that the traffic is making it across the wire to our Logstash servers. I guess my ability to understand tcpdump
while under pressure isn't as good as I thought it was.
We still don't understand why some traffic is being processed by Logstash, but not all traffic. Logs of certain types are being processed for some servers, but not all. The servers that are having issues with a specific log file being processed are not having issues with other log files being processed.
You can use the Logstash APIs to pull pipeline stats Node Stats API | Logstash Reference [7.17] | Elastic
Each Pipeline has stats for events failed, as well as events in and out of each plugin which should help you figure out what might be happening with the events. You're welcome to post the node stats here and I can take a look but it might be difficult for me to help without the pipeline definitions.
It's interesting, I have tried to tweak the batch size and number of workers for my pipelines, but the API says it's the default and doesn't report per-pipeline data.
Have you identified via the curl -XGET 'localhost:9600/_node/stats/pipelines?pretty'
API which pipeline or which pipeline steps appear to be dropping events?
I can see failures
though, nothing specifically about drops. The particular section that I know should be handling traffic that we're not seeing all of does not have any failures. The entire pipeline has a difference of only about 5k between in and out, on a 32 core machine, which I assume is trivial. Especially since the 3 Logstash servers we're running in that DC are processing over 80M log lines per hour.
I would expect to see failures represented in the logs of the logstash server, are you able to see any errors on the logstash logs?
You'll want to look at the in and out of each plugin in the pipeline to see which plugin is dropping or filtering the messages
We figured out that in 1 of the DCs we're having issues, it's because a particular log had its verbosity turned up to 11, so the pipeline was getting clogged. So that's 1 mystery solved.
We may be seeing a similar situation in our other DC that's having issues, but I'm not sure how to determine what that clog is.
I would recommend considering using the Logstash integration with Elastic Agent to monitor your Logstash clusters: Logstash | Documentation
You may also consider looking in Elasticsearch to see volumes of data coming from each DC to identify any other high volume DCs.
You could also implement something like Network Packet Capture integration in elastic agent or using Packetbeat on the Logstash node (or other network-logging) to better understand where your traffic is coming from.
The issue in our other DC seems to be a capacity one. We had 4 lightly loaded servers, so we decommissioned 1. The other 3 were still lightly loaded, plenty of free RAM and CPU headroom, but for some reason the system couldn't handle the requirements and logs were not able to be processed in a timely manner.
We have increased both pipeline.workers
and pipeline.batch.size
to the point where we were not seeing any additional performance gains, but without the 4th server we were unable to squeeze any more performance out of the system.
It might be worth looking at the pipeline steps being performed, I think the stats give you info on step throughput to see where the bottleneck is in the pipeline. You might be able to remove or optimize any heavy pipeline steps.
It may also be worth investigating if the bottleneck is actually logstash writing to elasticsearch which you may be able to optimize as well.
It might also make sense to see if you can tune the beats max bulk size to match your logstash batch size as well.
I don't know if I should start a new thread, but we're trying to squeeze more performance out of our Logstash servers. When we were having issues, the CPU load did not go up, neither did the RAM usage.
The hardware we use has 16 or 32 cores and 64GB of RAM, with 24GB allocated for Java heap space. We have tuned the pipeline.workers
to 128 and the pipeline.batch.size
to 500, but we can't get the CPU load on any of them to go above 8. Adding an additional Logstash server to the pool resolved our ingest issue, so we don't believe the issue is write performance, but we don't understand where our bottleneck is.
Are you using persistent queues or any other settings that might affect performance?
Do you have a complicated pipeline? You can look at the duration_in_millis in the pipeline stats for your pipeline to identify which steps are taking the most time/CPU in your pipeline.
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.