Well, It is not easy to find the cause of a bottleneck, so until any hint of what the cause can be I would say that nothing is totally irrelevant.
It is pretty common to think that Logstash is the cause of the bottleneck just to find in the end that it is Elasticsearch for some particular reason, that's why it is important to know the specs of each nodes and the number of nodes and indices/shards as this can impact in the overall performance of your cluster, both indexing and searching.
For example, you said that your nodes have 32 vCPUs, 120 GB of memory and multiple SSDs.
If you are using 70% of your memory to the Heap of Elastiscearch on a 120 GB RAM node, this will give you something near 84 GB of Heap, which can be too much, there is a recommendation to try keep the heap bellow ~ 30 GB because of some technical limitation of the JVM.
Also, you said that the VMs have multiple SSDs, how are you configuring the path.data
in your elasticsearch.yml
? Are you using multiple data paths? Are you using some kind of RAID? Can you share your elasticsearch.yml
.
Knowing the number of nodes and their specs is something that helps.
What is the size of your index? There is also a recommendation to keep the size of the shards around tens of GB, something close to 40 GB, 50 GB, so with 30 shards your index would be something like 1.5 TB? Depending on the number of nodes you can have a oversharded cluster which can also impact the overall performance.
From your Logstash pipeline I didn't see anything that could be slowing it down, while I do not use ruby code in my pipelines because it can slow down sometimes, I don't think this is the case.
But with 16 vCPUs, you could try to give more workers to that pipeline to see if something changes, this would be the first test.
How many Logstash nodes are consuming from that kafka topic and how many partitions does the topic have? You could also try to adjust the consumer_threads
in the kafka input to see if it helps also.
Do you have a separated monitoring cluster to get the metrics for your production cluster or are you using some other tool to monitor the index rate?