I've updated logstash.yml to use persistent queues for a bit of extra redundancy, and have run into a problem where individual nodes will just stop processing messages after
about 30+ minutes a random amount of time.
Kafka also shows the partition they're reading from as stopped.
The nodes themselves are still doing something, as they are still producing metrics, so I suspect it's related to the Kafka input somehow.
Once a node stops, the remaining nodes will pick up the messages from its Kafka partition after a ~5 minute delay, however if left long enough, all the nodes will eventually stop processing.
I really have no idea where to start looking as there is nothing of note in logstash-plain.log.
logstash.yml changes are:
queue.type: persisted queue.page_capacity: 10mb queue.max_bytes: 10mb queue.checkpoint.writes: 256
If I disable persistent queues, Logstash goes back to behaving itself. Enable queues again, and I get more stoppages.
Any ideas what could be causing this?