I have a query about an observed side-effect of my Logstash kafka-input configuration. It is not directly apparent to me what the problem is and hope people with deep expertise can help me out here.
We have a decent amount of EPS about 10K. Occasionally it fluctuates between 10K to 20K and worst case goes above 30K.
My node setup is as below:
- 9 Nodes of Kafka
- 18 Nodes of Logstash
- 18 Data Nodes of Elasticsearch, 3 Master Nodes, 9 Ingest Nodes
There are several topics in the Kafka cluster, about 20-25.
Each Logstash node has below configuration of Kafka input parameters:
Besides that following three params are set in pipeline configuration:
There is a single pipeline processing all events for each LS node.
The kafka topics each have below configuration:
Each topic has 540 partitions and each LS node Kafka input has 30 consumer-threads, so 30x18=540 one for each partition. When not under heavy pressure, I see that it can reach expected EPS even for single consumer thread (i.e. 18 total, but we leave it at 540 as some topics get heavy ingest depending on traffic conditions)
Now to the issue being faced. As noted there are several Kafka topics, two of them have very heavy ingest flow about 4-5K EPS for first and 2-6K EPS for the second.
When topic 1 is under heavy load (defined as lag of 4-5M messages) but no issue on ES ingestion, topic 2 EPS shows spikes from almost no ingest to expected ingest value when the lag in Kafka for topic 2 is "normal" (several K EPS).
Under normal conditions on Topic 1 (Lag of several K records), Topic 2 normal EPS looks like below, it is almost always above the threshold of 1000 EPS:
Topic 1 under heavy lag (2-5 M records) is still able to ingest well and looks like below:
Now when we have an ES outage (may be due to a heavy spike in traffic), this is what happens to Topic 1 pattern.
The hand-drawn blue line is how the traffic should look like as Kafka is buffering it. "retention.ms" is 3 days, and it is able to reach "normal" lag numbers fairly quickly (within few hours or at most a day).
The missing data/records never make it to Elasticsearch but I get those intermediate spikes with tails as if LS starts reading from new position every 1 hour or so.
Topic 2 suffers similarly during this as below:
Apologies for the ugly hand-drawn blue lines in above image.
None of the other topics suffer like this, data/records from them are recovered fully with same patterns in EPS trends as on "normal" days.
Could those of you who are running large clusters with Kafka as front buffering component suggest what I may be doing incorrectly here?, and how to debug this to find what component or configuration is causing this?
We are monitoring JMX metrics from Kafka and it is stable during these scenarios. There is plenty of storage for ES and heap is mostly at 50% full, only one or two nodes of ES go out of service due to sudden traffic spike. Even if I restart the ES cluster, while recovering all primary indices, ES cluster is able to reach ingest rates of 20-25K and sustained > 10K almost always. After replicas have been recovered, ingestion rates remain at sustained 10-15K as normal.
ELK is on version 7.17.9 and we plan to move to 7.17.14 soon. Kafka is at 6.0.1
Any guidance would be much appreciated.