I have finally built a proper pipeline for logstash to pull data from Kafka, and insert into elasticsearch.
However I seem to be having an issue, or possibly self induced configuration problem:
I need to pull from 8 topics, and here is the general config:
kafka {
auto_offset_reset => "smallest"
reset_beginning => "true"
consumer_id => "logf003"
consumer_threads => "6"
group_id => "kafka_DataMetrics"
topic_id => "DataMetrics_SharedDB"
type => "kafka-datametrics"
zk_connect => "zookeeper.service.consul:2181"
}
So my issue is this - if I use auto_offset_reset smallest, my 3 logstash forwarders if they crash would always go to offset #1 in a topic? Would this not cause a lot of data to be resubmitted? Given I have over 2 billion offsets to put into elasticsearch, wouldn't a crash and restart really set my processing back?
If I use the defaults, my logstash forwarders never pull all the data from the start of my kafka topics.
What should I use in this case so that I can:
- Load all data up to current from my Kafka topics
- Ensure that should my logstash instances crash, they will restart in an appropriate place, and not pull/send duplicate entries to elasticsearch?
I will also add that it would appear that all my logstash instances are pulling the exact same data at the same time because their message rates as tracked by metrics are all exactly the same.
Thanks for any advice.