Logstash ingestion slows rapidly after 1 hour

I have a logstash 6.7 on kubernetes which ingests .gz log files from s3 input at a rate of about 20k/minute for roughly 1 hour, and then it suddenly drops to 1400/minute. It will remain like that until I kill the pod. When the pod restarts, it goes back to 20k/minute again for an hour, and drop to 1400/minute again. It will remain like that for hours or until I kill the pod. This is measured from Kibana side. I subsequently took elasticsearch out of the equation by outputting to null{} and enabling the metrics plugin, which shows rates of 350/second (21k/minute). My cache, heap, cpu, pod memory never come close to limits, running at 10% capacity. I have 10gb persistent storage, and 8gb allocated to the queue, but it never exceeds 180mb used. I've leveraged jstat and various api calls to look at performance of logstash and nothing jumps out. I have tried dozens of configuration changes with respect to cpu, mem, heap, pipeline workers and batch sizes. What could explain the sudden drop in performance, and a back to normalcy after restart?

I cannot explain that, but if I faced that problem I would be looking at the pipeline stats API. Get a pair of sets of stats during the first hour when it is fast, and a pair of sets of stats when it slows down. See if the relative costs of each of the filters changes.

If you have a filter that caches data in-memory it could be that after an hour the cache fills and then it starts thrashing the cache. But that is just one of many possibilities.

Thanks. I had looked at the pipeline stats previously but I think part of the problem is that if I get the stats within the first hour it may be an average so then when I get the stats again after it slows down, the stats might be skewed since it hasn't averaged down yet. I will try again to get the "slow" stats, after letting it be slow for several hours, maybe it'll show me a difference.

Solution was to reduce the checkpoint writes for the queue. It had been set to 1, but raising it to 1024 has increased my throughput from 25k/minute to over 700k/minute ingesting from s3, without an\y further tuning, and is able to sustain that rate without the sudden performance drop after 1 hour

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.