If I understand correctly, Logstash has two thread pools: input (IN), processing and output combines (OUT). When Logstash is getting input from Kafka, at what point does it send an ACK to Kafka? If the ACK is sent when the IN buffer moves data moves buffer to OUT buffer, then there is a chance that there can be a loss of data if the process is restarted and there is some data in the OUT buffer which is not sent to Elasticsearch. However, if the ACK is sent after the data is sent to Elasticsearch, then process restart will always start where it left off.
I believe the input acks the receipt of data from Kafka as soon as it receives it. You can use persistent queues to avoid data loss.
Using persistent queues to avoid data loss is costly because of the associated storage. It is costly in terms of both time and money.
There is an open issue for making Logstash capable to running in a stateless mode where the input is not acknowledged until the data has been written successfully to the outputs. This would remove the need for an internal persistent queue, but does not appear to be worked on. So for now a persistent queue is your best bet if you want to avoid data loss.
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.