Kafka Input not committing and causing OOM

We recently upgraded to ES 5.5.1 and Logstash 5.5.1 from ES 2.4 and Logstash 2.4.

Our pipeline is pretty simple: a kafka input with multiple consumer threads subscribing to a pattern, a few filters, and an ES output.

In the process of upgrading, our pipeline seems to have broken in two ways (possibly related).

  1. After a while, logstash will die due to OOM issues. I tried upgrading from the default heap size to 4GB of RAM, but still face the same issues. We have 4 pipeline workers with a batch size of 2500. No matter what I adjust, we still end up with an OOM. I've analyzed the heap dump and found that most of the memory is consumed by Kafka ConsumerRecord instances. In fact, with a batch size of 250, we see approximately 2 million ConsumerRecords but only 1000 Event objects. We also see at least two ConsumerRecords for every message in Kafka.

Our Kafka configuration is:

  kafka {
    auto_offset_reset => "earliest"
    decorate_events => true
    topics_pattern => "${TOPIC_PREFIX}_([a-z]+)_logs"
    bootstrap_servers => "${KAFKA_BOOTSTRAP}"
    consumer_threads => 10
    codec => "json"
  }

Given the size of the pipeline, the use of the in-memory queue, and Kafka's default max fetch size, the OOM and number of ConsumerRecord instances doesn't seem to make much sense. Based on the source code, the "poll()" call should block until the queue frees up some space. It seems that the Kafka client is leaking ConsumerRecord objects.

  1. With the upgrade to Logstash to 5.5.1, we also ended up upgrading the Kafka client to the latest version. This enables Kafka-committed offsets vs ZK-committed offsets. However, we are not seeing the lag decreasing or offsets increasing for most of our consumers. If we change "auto_offset_reset" to "latest" and change the "group_id", then the offset tracking does seem to work for topics with a small number of partitions. Using "earliest" seems to cause offsets to not be committed at all.

Searching through Github issues, I found https://github.com/logstash-plugins/logstash-input-kafka/issues/140. It seems related, but so far I've had no luck getting the pipelines to work reliably.

I realize this may simply be an issue with the Kafka client but it seems like something else may be at play here.

I was finally able to stabilize logstash. I had to downgrade to logstash 5.4 and upgrade the kafka input plugin separately to see the Kafka logs. The issue came down to the session timing out on the Kafka consumer. In order to fix the issue I had to adjust "max_poll_records" (Kafka defaults this to 500 in newer versions of the client) and "ssession_timeout_ms" until things stabilized.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.