One of my "heavy duty" pipelines started acting strangely in the past weeks. I get alerts that ingesting stopped then I see that after a spike everything goes back to normal. I checked Elastic and Logstash logs on all nodes and there isn't anything. The Redis server log is empty since it was started a couple of months ago.
Logstash pipeline event rate:
Redis metrics:
As you see when Logstash stops getting events, Redis memory usage spikes which means the data flow is consistent. I checked various server related metrics but I don't see anything which would indicate why this particular pipeline stops. Longer (~2 minutes) pauses happen a couple of times a day (3-8) but when I check the event rate I spotted several occasions where it dropped to 0 but then returned working. However, none of the other pipelines (where data flow is consistent) shows a similar pattern.
Specs: Logstash 7.0.0; Redis server v=5.0.4
Is there any way to debug this and solve the problem? I'm out of ideas on what to check.
If you have multiple pipelines in a single instance and only one of them stops processing then I would lean towards it being a blocked output and back-pressure shutting down the pipeline. You could test that by enabling a persistent queue on the output and then monitoring the pipeline stats API which will indicate how much of the queue is full. If that rules out a blocked output it suggests a problem on the input side.
@Badger I've set up PQ (2GB) on one of my LS instances and saw that sometimes the queue gets filled. The event count and the queue size gradually grew. Since then, I didn't saw any alerts or "zero event" period. Does that mean that the ES servers couldn't keep up with the temporarily boost of events so it had to "deny" the inputs from that particular LS pipeline?
In any case, what options do I have if I want to keep events indexed as fast as possible? If I understand the issue correctly, the event rate is the issue and l'll need to add a new ES node to the cluster, correct?
@badger: yesterday I was closely monitoring the stack using mostly built-in charts in Kibana. Unfortunately, adding PQ to two of the LS nodes only resulted in avoiding zero indexing for that specific index. Now I see that there are still flatlines on the emitted rate count. Also, I'm sure that the ES nodes can handle twice the amount of events since there were times when the PQ-s sent double the normal amount and those were handled without any issue.
If you check the image I attached, the first row shows another type of log from the same servers and it does not show similarity...
The instances have plenty of free RAM (dozens of GB) same for LS which has 30GB overall. Same for CPU (32 core servers).
As I read some docs regarding this topic, ES should response with '429' if there are too many requests, right? But all logs (INFO level) aren't showing anything during these events.
So I don't really know what happens there.
The thing I really don't get is why do other pipelines with 1.5-2K events have a sawtooth-like pattern while this specific one is mostly a straight line with huge amplitude after flatlining.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.