I am trying to measure Logstash 6.6 performance to find the performance impact of enabling persisted queues on Logstash 6.6.
To eliminate any influence of elasticsearch environment - all tests are being outputted to /dev/null and there are no additional filters configured in logstash:
output {
file {
path => "/dev/null"
}
}
I have tried using following input configurations:
I have tested both memory and persisted queues for each of mentioned types of input(TCP, stdin and generator plugin). Persisted memory queue is located on a tmpfs partition in memory to eliminate any influence of HDD performance.
I use monitoring API and Grafana to measure the number of events that are being processed by logstash - using :9600/_node/stats/pipelines/ and checking the events "in" and "out" stats for measurements
I have used the same corpus of log lines for sending them via stdin and TCP connections (via the same python script) - and the same lines for generator plugin
Empirically I was able to see that best performance for my environment (for each queue type) appears to be achieved with following settings:
As can be seen from the results below enabling persisted queue significantly decreases pure logstash performance:
stdin input + memory queue 104.6 K events/s max
generator input + memory queue 119.0 K events/s max
TCP input + memory queue 36.2 K events/s max
stdin input + persisted queue 35.9 K events/s max
generator input + persisted queue 48.4 K events/s max
TCP input + persisted queue 19.2 K events/s max
Please note that log lines corpus, pipeline worker and batch size configurations and JVM settings remained the same between "memory" and "persisted" queue tests of the same input.
Is such performance impact expected for persisted queue - or something is off in my tests? What is recommended way of comparing memory and persisted queue performance?
@alexpanas, your testing methodology looks pretty good to me and the throughput ratio that you are seeing between memory queues and persisted queues is similar to the ratio we've seen in our internal testing. At its current state of development, the persistent queues feature does have a performance cost.
Are there any recommendations (besides queue.checkpoint.writes: 0 and queue.checkpoint.acks: 0 ) to improve persisted queue performance - or the ratio I was able to get is pretty much the best it could be?
Am I also understanding correctly that the throughput achievable with queue.checkpoint.writes: 1 and queue.checkpoint.acks: 1 can be used as a worst-case scenario marker for worst possible performance of persisted queue?
@alexpanas, those are the two main software settings that affect PQ performance and yes, setting them both to 1 is definitely the worst-case scenario. Beyond those two, we've found that increasing queue.page_capacity beyond the default 64MB is generally counterproductive. And finally, of course, hardware matters so placing the PQ files on a fast SSD will perform better than placing them on a slower HDD.
I have also noticed that pipeline.batch.size tends to behave differently for memory and persisted queues - increasing this parameter to 4000 increased the performance of memory queue - but significantly decreased the performance of persisted queue (generator input was able to push only 39.4 K events/s max instead of 48.4 K for 130 value I ended up using in my tests)
Is this expected? Are there any guidelines on how to calculate optimal value of this parameter without trial-and-error process (which would require experimenting on production environment, which I am not fond of)?
@alexpanas, it is certainly true, as you've found, that higher batch sizes tend to work better for memory queues. With smaller batch sizes, we see significant lock contention on the memory queues which reduces throughput.
It's harder to find optimal sizes with the PQ, though. Generally speaking, they are smaller, but the optimal size depends a lot on the distribution of input events and how quickly they can be processed on the output side. I would suggest experimenting in a QA or staging environment to find the optimal batch size there. I would then use that same batch size in production and if your throughput in production was similar to what you observed in your QA or staging environment, I would assume that you are very close to the optimal batch size in your production environment as well.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.