We have a Loadbalancer that sends events as a JSON Stream over TCP to Logstash using Persistent Queue. Logstash is configured with a TCP Input.
The Sender limits itself to 4 long-lived TCP connections to Logstash, but i believe the amount of data we need to send over this is overwhelming the 4 TCP connections.
50% of the total events are being dropped by the Sender during peak Loadbalancer load; less during low load (I've confirmed this).
I performed a packet capture on the sender and noticed very regular TCP Zero Window Size packets from Logstash.
This suggests Logstash is not able to flush its buffers fast enough.
Originally, i thought increasing Pipeline Works or Batch size could help, but after further research, i found that the Persistent Queue lives between the Input and Filters. And the Workers and batch size relates to pulling data from the Queue, rather than the Input buffer.
Is this correct?
I've confirmed the PQ is not filling up, though has around 10,000 Events (which is about 0.001% of our max PQ Size of 1TB.
Logstash has 8 CPUs, 8 pipeline workers, default batch size of 125, 4GB memory.
Expected e/s is ~5k e/s, but the Sender is currently only able to send ~1.3k e/s
Could it be Disk IO speeds that are slowing the writes from the buffers to disk?
Would increasing workers or batch size help here?
Are their any RHEL OS network related settings we could adjust to help?
Yes, the queues, be it in-memory or persistent, exists between the input and the filter/output blocks.
Pipeline settings like batch size and workers are applied to filters and outputs, not inputs.
What is the average event size? Have you tried to use memory queues instead of persistent queues? Personally I avoid using persistent queues, if I need some kind of buffer I use Kafka, this makes the infrastructure a little less complicated, but adds way more flexibility and performance.
What is the disk type? HDD? One issue is that the OS tcp buffers may be filling up faster than Logstash can write in the persistent queue, and the persistent queue does not protect from data loss for tcp inputs.
In this case the speed of the disk may be a problem, you may try to tune the tcp settings of the OS, but this is out of the scope of this forum as it is not related to Logstash, here is some example of tuning that may or may not work for you.
I don't think so, they influence only the filters and output stages, it doesn't seem that you have any issue related to filters or outputs, but what is your output? Also, do you have any errors or warnings in Logstas hlogs? And which version are you using?
As mentioned, try to see if you can do any tuning on the network of the operating system.
We are not using memory queues as we needed the ability to recover unprocessed logs during abnormal logstash termination. I was not aware that PQ does not protect from data loss for TCP inputs, though I would hope it provides atleast some protection over in-memory queue?
My Pipeline has both TCP and HTTP inputs so it would atleast protect from loss for the HTTP input.
The log source is streaming new line terminated JSON over a raw tcp connection, and not something we can get into kafka directly (unless we create a different pipeline just to output into kafka, and then my existing pipline inputs from kafka).
We're using a network flash drive presented over ISCSI
Output is to another logstash (an aggregation logstash) before its output to Elastic (no filters)
No errors or warnings on the logstash plain log.
Logstash 8.13.4
I'll try:
in-memory queue, atleast temporarily to see if this helps.
checking iostat for high disk utilisation or waits.
If the pipeline does not read data fast enough then the PQ will fill, at that point the tcp input stops reading data. The TCP stack will continue to buffer data until its buffers fill, at which point it will zero out the transmission window (exactly what you say you are seeing). At that point the source has to stop sending data. You will not lose anything, it will just be delayed.
In our case, the PQ is not filling up. Monitoring shows <1% full, but we still see zero windows to the source.
Unfortunately, our particular source will not queue events and will drop them instead. The source (a Loadbalancer) is sending metadata in realtime of the http traffic going through it ( things like timestamp, duration, http request headers, response headers, and other metrics).
Not for all input, it is listed in the limitations, but can easily be missed.
Input plugins that do not use a request-response protocol cannot be protected from data loss. Tcp, udp, zeromq push+pull, and many other inputs do not have a mechanism to acknowledge receipt to the sender.
You have tcp and http inputs on the same pipeline? Is the data the same?
Personally I like to have different inputs being run on different pipelines, it makes easer to manage and tune the specific pipeline as different inputs can have different requirements.
But this is not a bit problem, more a personal choice.
Yeah, this is pretty common to have, it is what I do.
I receive logs from multiple network devices that cannot write directly into Kafka, so they send using tcp/udp to a Logstash pipeline that acts as a Kafka producer, then I have other Logstash pipeline acting as a Kafka consumer, sometimes these pipelines are on the same instance, other times they are on a different instances to have more isolation, it depends on the requirements.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.