I'm workin on switching our log aggregation from Graylog to full ELK setup. While reading on LS clustering found recommendation of adding queue.checkpoint.writes: 1
to the config to increase durability: https://www.elastic.co/guide/en/logstash/current/deploying-and-scaling.html
Unfortunately with this setting on, performance drop is unacceptable. It's dropping from average of 15-4K events per second to 500 with gaps between sending events to ES:
Without
With
While queue size is pretty minimal:
My guess the culprit is I/O since we're running LS in the cloud with regular EBS backed up root volume.
So first question was: How important it is to have persistent queues with queue.checkpoint.writes
set to 1 in case when we want to have multiple instances for each pipeline?
The other question that I'm struggling to find answer for is GC during persistent queues activated. The same workers/batch.size settings have very different GC performance without (left side) and with (right side) persistent queue:
What is causing such spikes in GC with persistent queues? Is it something that needs to be addressed with other settings/tuning or is expected in this case?
Will appreciate any feedback. Thank you!