GC for persistent queue and LS clustering

I'm workin on switching our log aggregation from Graylog to full ELK setup. While reading on LS clustering found recommendation of adding queue.checkpoint.writes: 1 to the config to increase durability: https://www.elastic.co/guide/en/logstash/current/deploying-and-scaling.html

Unfortunately with this setting on, performance drop is unacceptable. It's dropping from average of 15-4K events per second to 500 with gaps between sending events to ES:

Without

With

While queue size is pretty minimal:

My guess the culprit is I/O since we're running LS in the cloud with regular EBS backed up root volume.

So first question was: How important it is to have persistent queues with queue.checkpoint.writes set to 1 in case when we want to have multiple instances for each pipeline?

The other question that I'm struggling to find answer for is GC during persistent queues activated. The same workers/batch.size settings have very different GC performance without (left side) and with (right side) persistent queue:


What is causing such spikes in GC with persistent queues? Is it something that needs to be addressed with other settings/tuning or is expected in this case?

Will appreciate any feedback. Thank you!

It doesn't look like you are measuring spikes in GC, you are measuring spikes in memory usage. Those are a really good thing! My guess is that on the left you are looking at more and more and more unpersisted events being stored on the heap, whilst on the right you are seeing them being stuffed in a queue so that logstash can get on and do something. Rapid sawtooth patterns in memory usage are usually really healthy.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.