Kafka Input not committing and causing OOM

jalaziz · August 16, 2017, 4:53am

We recently upgraded to ES 5.5.1 and Logstash 5.5.1 from ES 2.4 and Logstash 2.4.

Our pipeline is pretty simple: a kafka input with multiple consumer threads subscribing to a pattern, a few filters, and an ES output.

In the process of upgrading, our pipeline seems to have broken in two ways (possibly related).

After a while, logstash will die due to OOM issues. I tried upgrading from the default heap size to 4GB of RAM, but still face the same issues. We have 4 pipeline workers with a batch size of 2500. No matter what I adjust, we still end up with an OOM. I've analyzed the heap dump and found that most of the memory is consumed by Kafka ConsumerRecord instances. In fact, with a batch size of 250, we see approximately 2 million ConsumerRecords but only 1000 Event objects. We also see at least two ConsumerRecords for every message in Kafka.

Our Kafka configuration is:

  kafka {
    auto_offset_reset => "earliest"
    decorate_events => true
    topics_pattern => "${TOPIC_PREFIX}_([a-z]+)_logs"
    bootstrap_servers => "${KAFKA_BOOTSTRAP}"
    consumer_threads => 10
    codec => "json"
  }

Given the size of the pipeline, the use of the in-memory queue, and Kafka's default max fetch size, the OOM and number of ConsumerRecord instances doesn't seem to make much sense. Based on the source code, the "poll()" call should block until the queue frees up some space. It seems that the Kafka client is leaking ConsumerRecord objects.

With the upgrade to Logstash to 5.5.1, we also ended up upgrading the Kafka client to the latest version. This enables Kafka-committed offsets vs ZK-committed offsets. However, we are not seeing the lag decreasing or offsets increasing for most of our consumers. If we change "auto_offset_reset" to "latest" and change the "group_id", then the offset tracking does seem to work for topics with a small number of partitions. Using "earliest" seems to cause offsets to not be committed at all.

Searching through Github issues, I found https://github.com/logstash-plugins/logstash-input-kafka/issues/140. It seems related, but so far I've had no luck getting the pipelines to work reliably.

I realize this may simply be an issue with the Kafka client but it seems like something else may be at play here.

jalaziz · August 17, 2017, 12:31am

I was finally able to stabilize logstash. I had to downgrade to logstash 5.4 and upgrade the kafka input plugin separately to see the Kafka logs. The issue came down to the session timing out on the Kafka consumer. In order to fix the issue I had to adjust "max_poll_records" (Kafka defaults this to 500 in newer versions of the client) and "ssession_timeout_ms" until things stabilized.

system · September 14, 2017, 12:32am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Logstash not committing offset to Kafka Logstash	1	1521	April 6, 2017
Kafka Input performance / config tuning Logstash	4	6384	February 6, 2017
Logstash Kafka Input Plugin consumer not reading messages of Kafka topic Logstash	2	1938	August 1, 2017
Kafka input plugin creates anywhere from 2-5x more documents than in topic Logstash	1	455	May 22, 2017
Logstash 2.3.1 and Kafka offset issue Logstash	4	1405	July 6, 2017

Kafka Input not committing and causing OOM

Related topics