Delay between Logstash and ES? ELK 7.16.1 [ECK]

rudyfaile · January 31, 2022, 9:53pm

Howdy,

Similar to

and

We're experiencing a 1-2 minute delay between logstash and ES and can't figure out why.

We're relatively sure the problem is between logstash and ES. If I manually trigger an error it appears in rsyslog immediately, and if i have the aggregator (fluent) write to a local log file it is also instant. I suppose the problem could also be from fluent -> logstash, but I'm also not quite sure how i would go about testing that....

We thought it might be ES resource constraints, but we doubled the amount of es data nodes to no avail. We also tested manual POST and GET in es-eck and it seems fine.

We're also missing some logs but I'm not sure how related everything is.

From a logstash pod:

bash-4.2$ curl -XGET 'localhost:9600/_node/stats/os?pretty'
{
  "host" : "logstash-xxx-xxx",
  "version" : "7.16.1",
  "http_address" : "0.0.0.0:9600",
  "id" : "xxx-xxx-xxx-xxx-xxx",
  "name" : "logstash-xxx-xxx",
  "ephemeral_id" : "xxx-xxx-xxx-xxx-xxx",
  "status" : "green",
  "snapshot" : false,
  "pipeline" : {
    "workers" : 64,
    "batch_size" : 125,
    "batch_delay" : 50
  },
  "monitoring" : {
    "hosts" : [ "https://redacted:redacted" ],
    "username" : "redacted"
  },
  "os" : {
    "cgroup" : {
      "cpuacct" : {
        "control_group" : "/",
        "usage_nanos" : 58545196283894
      },
      "cpu" : {
        "control_group" : "/",
        "cfs_quota_micros" : 1600000,
        "stat" : {
          "number_of_times_throttled" : 486,
          "number_of_elapsed_periods" : 174623,
          "time_throttled_nanos" : 493982675334
        },
        "cfs_period_micros" : 100000
      }
    }
  }
}

@Christian_Dahlqvist I have seen you weigh in on these threads with super helpful advice so I hope you don't mind the ping. Willing to provide any information that would help fix this

Christian_Dahlqvist · February 1, 2022, 3:10am

If doubling the number of data nodes does not make a difference I would recommend first looking at your Logstash config. Can you share it?

To get an idea where the delay arises I would recommend setting a timestamp field at the start of Logstash processing for all events that indicate when Logstash received it. You can then add or modify ingest pipeline in Elasticsearch for all indices which adds another timestamp indicating when Elasticsearch received the event. You should then be able to use a script in the pipeline to calculate the delays between the ES timestamp and the event creation timestamp as well as Logstash timestamps. Store these in two separate fields. This will let you analyse delay over time as long as your servers have synchronized time.

rudyfaile · February 1, 2022, 7:43am

Hey @Christian_Dahlqvist thanks for the response. I'll try time stamping the processing.

Which Logstash configs are you looking for? The curl from the original post includes really the only settings we're changing via logstash.yml

Christian_Dahlqvist · February 1, 2022, 8:13am

I would like to see the actual Logstash pipeline(s) config and processing logic.

It would also help to get the full output of the cluster stats API from Elasticsearch and information about how many indices and shards that you are actively indexing into.

Rios · February 1, 2022, 8:48am

Do you have the pipeline statistic?
http://localhost:9600/_node/stats/pipelines?pretty

rudyfaile · February 1, 2022, 6:46pm

@Christian_Dahlqvist I've provided the output of the cluster stats api from es here and @Rios command for pipeline statistics here -- hopefully this is what you need. I redacted all unique ids, hopefully that isn't too disruptive to the investigation.

In terms of the cluster, we have currently running:

3 master nodes
11 data nodes
2 kibana nodes
4 logstash nodes
1 apm server

most of these are set to scale with usage, for example logstash nodes are usually around 2 but scale up with traffic. The data nodes are static, but as I mentioned before it did not make any difference at 5 data nodes or 11.

leandrojmp · February 1, 2022, 6:57pm

You need to share your logstash pipeline, what are the input, filters and outputs, this would help to see if there is something in your pipeline configuration that could slow down the ingestion.

rudyfaile · February 1, 2022, 7:17pm

Unfortunately I can't share that. Is it possible to debug without it?

Christian_Dahlqvist · February 1, 2022, 9:29pm

The pipeline statistics show what type of filters you are using, so I guess we can give some general guidelines even if we do not have the exact pipeline.

If you as discussed earlier have verified that the delay is indeed in Logstash there are a few things to look at.

It looks like you have a single main pipeline with a number of different inputs and codecs followed by a number of mutate and grok filters. I usually recommend looking at grok filters as these can be very slow if written badly. The stats however look fine and I did not spot any filter that seems slow.

Another observation is that you have a large number of elasticsearch output plugins configured. This indicates that you are likely writing to a large number of different index patters, which can be very inefficient. Logstash works based on micro-batches and if one of these contains records from a lot of different index patterns the batch will result in a lot of calls to Elasticsearch with quite small bulk sizes. This takes time and can be bad for performance.

I would recommend looking into consolidating indices so you actively write to fewer. You may also want to increase the batch size Logstash uses to increase the average bulk request size.

One way to improve this would be to split the processing into multiple pipelines, each with a single Elasticsearch output. You can have a main pipeline with the inputs and the logic required to set timestamps and determine which event that goes to which pipeline. You the distribute data using pipeline-to-pipeline communication.

Another way would be to store the index name or pattern in a field in the event metadata. You can then use this field in a single elasticsearch output to generate a single bulk request that targets all the indices in one go.

rudyfaile · February 2, 2022, 8:50pm

Hey @Christian_Dahlqvist thanks for the advice. I tried some of the logstash changes as that seemed like the lowest hanging fruit but no dice.

I'm going to look into reducing/consolidating the number of indices and the multiple pipelines thing.

Thanks for the advice, hopefully one of these suggestions help!

Christian_Dahlqvist · February 2, 2022, 9:52pm

Did you add timestamps to verify where the delay occurs? If so, does delay vary for different types of data?

rudyfaile · February 3, 2022, 5:38pm

Did you add timestamps to verify where the delay occurs? If so, does delay vary for different types of data?

I am doing this now to try and pinpoint the delay. Interesting point on the data, I would not suspect so but will let you know.

rudyfaile · February 4, 2022, 7:35pm

Example request with timestamps:

syslog timestamp: Feb 4, 2022 @ 13:26:35.202
fluentd timestamp: 2022-02-04 13:26:35
logstash_processed_at timestamp: 13:27:36.022

We can see the exact delay is represented in the difference between the fluent and logstash timestamps.

I think this means logstash is holding the data. I don't think there is a delay from the time it leaves fluent and arrives in logstash, and I also think this confirms that the delay is not from logstash -> ES because the logstash processed time is the exact time it appears in Kibana.

Christian_Dahlqvist · February 4, 2022, 7:41pm

Is the Logstash timestamp assigned immediately after the input plugins? Are you setting a timestamp in an ingest pipeline as well so we know when it reached Elasticsearch? Have you verified that the time is synchronised across all hosts?

Rios · February 4, 2022, 9:16pm

Have you tried to increase batch_size? Try with pipeline.batch.size: 1000
"workers" : 64 - is this too much? Default is CPU number

Also you can test will increase ES output in your case:

http_compression - if your line is longer, min 10K , maybe reduce network delay. I have a short log lines,around 1K, no improvements
networkaddress.cache.ttl

rudyfaile · February 14, 2022, 7:59pm

I just wanted to follow up here so it helps others that come across this thread.

We ended up writing a script on the logstash containers to send a payload directly to logstash and we found it was being indexed immediately. So that kind of eliminated logstash as the problem.

We then found via a TCPdump on the fluentd hosts that once the message appeared in the TCP stream it was arriving in logstash and being processed quickly and indexed.... but was taking 60+ seconds from the time an error was logged to appear in the TCP stream

so.... unfortunately while in my initial tests of fluentd it was indeed immediately processing the forwarding to a local file, it was actually getting held in a buffer before being forwarded to an external host. Specifically, looking at Config: Buffer Section - Fluentd - the default flush_interval is 60s, which kind of explains why it was taking just over 60 seconds consistently.

Adding a buffer to fluend conf and modifying these values in the fixed the delay.

Thank you @Christian_Dahlqvist and @Rios for looking and trying to help. I apologize that my tests were incorrect which led me to believe it was logstash that was the component where the problem should be solved, and not fluentd.

helper logstash processing snippet:

  mutate {
    add_field => { "realtimestamp" => "%{@timestamp}" }
        }

    ruby {
        code => "event.set('logstash_processed_at', Time.now());"
        }

    mutate {
    convert => { "logstash_processed_at" => "string" }
        }

    date {
    match => [ "logstash_processed_at", "ISO8601" ]
    target => "@timestamp"
    add_tag => "timestamp_changed"
    }

system · March 14, 2022, 8:00pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Time delay between logstash and elasticsearch Logstash	12	5506	May 12, 2020
Huge Time Delay between logstash and elasticsearch Logstash	5	3573	July 6, 2017
Huge delay in logs to ES from Kafka/Logstash Elasticsearch	5	2134	August 4, 2020
Logstash 429 errors when pushing data to Elastic Search Logstash	2	666	September 21, 2017
Data delay in ELK Logstash	11	4680	February 22, 2017

Delay between Logstash and ES? ELK 7.16.1 [ECK]

Related topics