We are using logstash v 6.8.2 with the persistent queue enabled. Recently, one of our clusters started to show signs that the queue had data, but the event count was 0. After observing the behavior for several days, I've realized that it looks like some of the pages just do not get removed. They persist, and grow in number, until eventually logstash starts to show load stress, pinning all the cores. The only way to recover is to stop, wipe the queue, and start again. Restarting without wiping does not make the events go away.
This is what the contents look like - you can see an older checkpoint and page file (153) and the current head page (248). 153 files have not been accessed or updated.
Oct 24 00:21 checkpoint.153
Oct 24 01:12 checkpoint.head
Oct 24 00:21 page.153
Oct 24 01:12 page.248
Access: 2019-10-24 00:21:34.711122732 +0000
Modify: 2019-10-24 00:21:34.711122732 +0000
Change: 2019-10-24 00:21:34.712122735 +0000
I read this from another topic:
"Periodically the Page is fsynced. As a filter worker loops around it acknowledges the batch it was working with, discards it and reads a new batch. Along with the fsync of the Page file, a v small bookkeeping file we call a Checkpoint is atomically (over)written to disk. The Checkpoint holds info about where in the page the write and acked pointers are - so on a restart Logstash can replay any any events not acked."
As a workaround until we can upgrade - we are considering manually sweeping the files that have not been accessed in recent history.
But I'm really curious about the cause of this - especially since we have 11 clusters of Logstash+elasticsearch, and this is the only one having this problem.
- Is there any utility to inspect the files to determine which (if any) of the events have not been acked?
- Will a specific logging level other than debug show when fsync is running?
- We have some stuff going to the dead letter queue - I assumed that DL events are acked... but maybe I am wrong
Thanks for reading
-m