We're experiencing a 1-2 minute delay between logstash and ES and can't figure out why.
We're relatively sure the problem is between logstash and ES. If I manually trigger an error it appears in rsyslog immediately, and if i have the aggregator (fluent) write to a local log file it is also instant. I suppose the problem could also be from fluent -> logstash, but I'm also not quite sure how i would go about testing that....
We thought it might be ES resource constraints, but we doubled the amount of es data nodes to no avail. We also tested manual POST and GET in es-eck and it seems fine.
We're also missing some logs but I'm not sure how related everything is.
@Christian_Dahlqvist I have seen you weigh in on these threads with super helpful advice so I hope you don't mind the ping. Willing to provide any information that would help fix this
If doubling the number of data nodes does not make a difference I would recommend first looking at your Logstash config. Can you share it?
To get an idea where the delay arises I would recommend setting a timestamp field at the start of Logstash processing for all events that indicate when Logstash received it. You can then add or modify ingest pipeline in Elasticsearch for all indices which adds another timestamp indicating when Elasticsearch received the event. You should then be able to use a script in the pipeline to calculate the delays between the ES timestamp and the event creation timestamp as well as Logstash timestamps. Store these in two separate fields. This will let you analyse delay over time as long as your servers have synchronized time.
I would like to see the actual Logstash pipeline(s) config and processing logic.
It would also help to get the full output of the cluster stats API from Elasticsearch and information about how many indices and shards that you are actively indexing into.
@Christian_Dahlqvist I've provided the output of the cluster stats api from es here and @Rios command for pipeline statistics here -- hopefully this is what you need. I redacted all unique ids, hopefully that isn't too disruptive to the investigation.
In terms of the cluster, we have currently running:
3 master nodes
11 data nodes
2 kibana nodes
4 logstash nodes
1 apm server
most of these are set to scale with usage, for example logstash nodes are usually around 2 but scale up with traffic. The data nodes are static, but as I mentioned before it did not make any difference at 5 data nodes or 11.
You need to share your logstash pipeline, what are the input, filters and outputs, this would help to see if there is something in your pipeline configuration that could slow down the ingestion.
The pipeline statistics show what type of filters you are using, so I guess we can give some general guidelines even if we do not have the exact pipeline.
If you as discussed earlier have verified that the delay is indeed in Logstash there are a few things to look at.
It looks like you have a single main pipeline with a number of different inputs and codecs followed by a number of mutate and grok filters. I usually recommend looking at grok filters as these can be very slow if written badly. The stats however look fine and I did not spot any filter that seems slow.
Another observation is that you have a large number of elasticsearch output plugins configured. This indicates that you are likely writing to a large number of different index patters, which can be very inefficient. Logstash works based on micro-batches and if one of these contains records from a lot of different index patterns the batch will result in a lot of calls to Elasticsearch with quite small bulk sizes. This takes time and can be bad for performance.
I would recommend looking into consolidating indices so you actively write to fewer. You may also want to increase the batch size Logstash uses to increase the average bulk request size.
One way to improve this would be to split the processing into multiple pipelines, each with a single Elasticsearch output. You can have a main pipeline with the inputs and the logic required to set timestamps and determine which event that goes to which pipeline. You the distribute data using pipeline-to-pipeline communication.
Another way would be to store the index name or pattern in a field in the event metadata. You can then use this field in a single elasticsearch output to generate a single bulk request that targets all the indices in one go.
We can see the exact delay is represented in the difference between the fluent and logstash timestamps.
I think this means logstash is holding the data. I don't think there is a delay from the time it leaves fluent and arrives in logstash, and I also think this confirms that the delay is not from logstash -> ES because the logstash processed time is the exact time it appears in Kibana.
Is the Logstash timestamp assigned immediately after the input plugins? Are you setting a timestamp in an ingest pipeline as well so we know when it reached Elasticsearch? Have you verified that the time is synchronised across all hosts?
I just wanted to follow up here so it helps others that come across this thread.
We ended up writing a script on the logstash containers to send a payload directly to logstash and we found it was being indexed immediately. So that kind of eliminated logstash as the problem.
We then found via a TCPdump on the fluentd hosts that once the message appeared in the TCP stream it was arriving in logstash and being processed quickly and indexed.... but was taking 60+ seconds from the time an error was logged to appear in the TCP stream
so.... unfortunately while in my initial tests of fluentd it was indeed immediately processing the forwarding to a local file, it was actually getting held in a buffer before being forwarded to an external host. Specifically, looking at Config: Buffer Section - Fluentd - the default flush_interval is 60s, which kind of explains why it was taking just over 60 seconds consistently.
Adding a buffer to fluend conf and modifying these values in the fixed the delay.
Thank you @Christian_Dahlqvist and @Rios for looking and trying to help. I apologize that my tests were incorrect which led me to believe it was logstash that was the component where the problem should be solved, and not fluentd.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.