I have a cluster where the ingestion rate has gone to significantly lower values than we have seen previously. Currently we're processing ~2.5keps. We have seen previously peaks in the 15k range. Our cluster runs in kube with 9 data nodes, 3 ingest, 5 masters. All data pods are on their own node so there should be no resource contention there. The ingestion pods share the same nodes as the logstash pods. Performance metrics from these nodes, still show no obvious source of degradation. We are pulling logs from kafka via logstash and previous performance shows a single logstash capable of submitting ~2.5keps. We currently have 6 logstash running and still only seeing 2.5k. Our monitoring solution is showing the average response time _bulk requests to be ~4s. Unfortunately I do not have a baseline of this value when it is good. Over lunch today we did see a short spike where we had a sustained ingestion rate of > 10k for about 15 mins before falling back down. Our EBS volumes are all at 10k IOPS and everything monitoring wise on there looks fine. Currently configured for 6 shards per indices.
At a complete loss on why we are seeing this behaviour esp given the undeterministic nature of it.
The kafka currently has a couple billion messages in it (we're quite behind) and I tested the egress from kafka via logstash by just dumping to null and was easily able to pull off 4keps from kafka per logstash(x4) I ran against it (total>16k eps). So I feel confident that the kafka is not the issue.
If anyone has any ideas, it would be greatly appreciated.