We recently upgraded to ES 5.5.1 and Logstash 5.5.1 from ES 2.4 and Logstash 2.4.
Our pipeline is pretty simple: a kafka input with multiple consumer threads subscribing to a pattern, a few filters, and an ES output.
In the process of upgrading, our pipeline seems to have broken in two ways (possibly related).
- After a while, logstash will die due to OOM issues. I tried upgrading from the default heap size to 4GB of RAM, but still face the same issues. We have 4 pipeline workers with a batch size of 2500. No matter what I adjust, we still end up with an OOM. I've analyzed the heap dump and found that most of the memory is consumed by Kafka ConsumerRecord instances. In fact, with a batch size of 250, we see approximately 2 million ConsumerRecords but only 1000 Event objects. We also see at least two ConsumerRecords for every message in Kafka.
Our Kafka configuration is:
kafka {
auto_offset_reset => "earliest"
decorate_events => true
topics_pattern => "${TOPIC_PREFIX}_([a-z]+)_logs"
bootstrap_servers => "${KAFKA_BOOTSTRAP}"
consumer_threads => 10
codec => "json"
}
Given the size of the pipeline, the use of the in-memory queue, and Kafka's default max fetch size, the OOM and number of ConsumerRecord instances doesn't seem to make much sense. Based on the source code, the "poll()" call should block until the queue frees up some space. It seems that the Kafka client is leaking ConsumerRecord objects.
- With the upgrade to Logstash to 5.5.1, we also ended up upgrading the Kafka client to the latest version. This enables Kafka-committed offsets vs ZK-committed offsets. However, we are not seeing the lag decreasing or offsets increasing for most of our consumers. If we change "auto_offset_reset" to "latest" and change the "group_id", then the offset tracking does seem to work for topics with a small number of partitions. Using "earliest" seems to cause offsets to not be committed at all.
Searching through Github issues, I found https://github.com/logstash-plugins/logstash-input-kafka/issues/140. It seems related, but so far I've had no luck getting the pipelines to work reliably.
I realize this may simply be an issue with the Kafka client but it seems like something else may be at play here.