Can Logstash Kafka input plug-in make any delivery guarantee?
Per Tips and Best Practices | Logstash Reference [7.6] | Elastic, we don't get at-least-once guarantee.
"Does Kafka Input commit offsets only after the event has been safely persisted to the PQ?"
"Does Kafa Input commit offsets only for events that have passed the pipeline fully?"
No, we can’t make that guarantee. Offsets are committed to Kafka periodically. If writes to the PQ are slow or blocked, offsets for events that haven’t safely reached the PQ can be committed.
It sounds like commit (with enable_auto_commit
enabled) is not blocked by PQ write. Therefore, you can have data-loss scenario when
- Offsets are committed
- But records are not fully persisted to PQ yet
- Logstash crashes
Which is contradicting the blog: Just Enough Kafka For The Elastic Stack, Part 2 | Elastic Blog and it implies we don't get at-most-once guarantee either.
Kafka is designed to follow at-least-once semantics — messages are guaranteed to be not lost, but may be redelivered. This means there could be scenarios where Logstash crashes, while the offset is still in memory, and not committed. This can cause messages to be re-delivered, or in other words, duplicated.
You will get duplicate data when
- Offsets are not committed
- But records are processed
- Logstash crashes
In summary:
- Is there any type of delivery guarantee that logstash can make on consumption?
- Would it work differently when
enable_auto_commit
is disabled? - Any general recommendation on reliable Kafka consumption? Such as:
- Monitor PQ closely to make sure PQ write doesn't fall behind more than
auto_commit_interval_ms
- Monitor PQ closely to make sure PQ write doesn't fall behind more than