Delay between Logstash and ES? ELK 7.16.1 [ECK]

Howdy,

Similar to

and

We're experiencing a 1-2 minute delay between logstash and ES and can't figure out why.

We're relatively sure the problem is between logstash and ES. If I manually trigger an error it appears in rsyslog immediately, and if i have the aggregator (fluent) write to a local log file it is also instant. I suppose the problem could also be from fluent -> logstash, but I'm also not quite sure how i would go about testing that....

We thought it might be ES resource constraints, but we doubled the amount of es data nodes to no avail. We also tested manual POST and GET in es-eck and it seems fine.

We're also missing some logs but I'm not sure how related everything is.

From a logstash pod:

bash-4.2$ curl -XGET 'localhost:9600/_node/stats/os?pretty'
{
  "host" : "logstash-xxx-xxx",
  "version" : "7.16.1",
  "http_address" : "0.0.0.0:9600",
  "id" : "xxx-xxx-xxx-xxx-xxx",
  "name" : "logstash-xxx-xxx",
  "ephemeral_id" : "xxx-xxx-xxx-xxx-xxx",
  "status" : "green",
  "snapshot" : false,
  "pipeline" : {
    "workers" : 64,
    "batch_size" : 125,
    "batch_delay" : 50
  },
  "monitoring" : {
    "hosts" : [ "https://redacted:redacted" ],
    "username" : "redacted"
  },
  "os" : {
    "cgroup" : {
      "cpuacct" : {
        "control_group" : "/",
        "usage_nanos" : 58545196283894
      },
      "cpu" : {
        "control_group" : "/",
        "cfs_quota_micros" : 1600000,
        "stat" : {
          "number_of_times_throttled" : 486,
          "number_of_elapsed_periods" : 174623,
          "time_throttled_nanos" : 493982675334
        },
        "cfs_period_micros" : 100000
      }
    }
  }
}

@Christian_Dahlqvist I have seen you weigh in on these threads with super helpful advice so I hope you don't mind the ping. Willing to provide any information that would help fix this :pray:

If doubling the number of data nodes does not make a difference I would recommend first looking at your Logstash config. Can you share it?

To get an idea where the delay arises I would recommend setting a timestamp field at the start of Logstash processing for all events that indicate when Logstash received it. You can then add or modify ingest pipeline in Elasticsearch for all indices which adds another timestamp indicating when Elasticsearch received the event. You should then be able to use a script in the pipeline to calculate the delays between the ES timestamp and the event creation timestamp as well as Logstash timestamps. Store these in two separate fields. This will let you analyse delay over time as long as your servers have synchronized time.

Hey @Christian_Dahlqvist thanks for the response. I'll try time stamping the processing.

Which Logstash configs are you looking for? The curl from the original post includes really the only settings we're changing via logstash.yml

I would like to see the actual Logstash pipeline(s) config and processing logic.

It would also help to get the full output of the cluster stats API from Elasticsearch and information about how many indices and shards that you are actively indexing into.

Do you have the pipeline statistic?
http://localhost:9600/_node/stats/pipelines?pretty

@Christian_Dahlqvist I've provided the output of the cluster stats api from es here and @Rios command for pipeline statistics here -- hopefully this is what you need. I redacted all unique ids, hopefully that isn't too disruptive to the investigation.

In terms of the cluster, we have currently running:

  • 3 master nodes
  • 11 data nodes
  • 2 kibana nodes
  • 4 logstash nodes
  • 1 apm server

most of these are set to scale with usage, for example logstash nodes are usually around 2 but scale up with traffic. The data nodes are static, but as I mentioned before it did not make any difference at 5 data nodes or 11.

You need to share your logstash pipeline, what are the input, filters and outputs, this would help to see if there is something in your pipeline configuration that could slow down the ingestion.

1 Like

Unfortunately I can't share that. Is it possible to debug without it?

The pipeline statistics show what type of filters you are using, so I guess we can give some general guidelines even if we do not have the exact pipeline.

If you as discussed earlier have verified that the delay is indeed in Logstash there are a few things to look at.

It looks like you have a single main pipeline with a number of different inputs and codecs followed by a number of mutate and grok filters. I usually recommend looking at grok filters as these can be very slow if written badly. The stats however look fine and I did not spot any filter that seems slow.

Another observation is that you have a large number of elasticsearch output plugins configured. This indicates that you are likely writing to a large number of different index patters, which can be very inefficient. Logstash works based on micro-batches and if one of these contains records from a lot of different index patterns the batch will result in a lot of calls to Elasticsearch with quite small bulk sizes. This takes time and can be bad for performance.

I would recommend looking into consolidating indices so you actively write to fewer. You may also want to increase the batch size Logstash uses to increase the average bulk request size.

One way to improve this would be to split the processing into multiple pipelines, each with a single Elasticsearch output. You can have a main pipeline with the inputs and the logic required to set timestamps and determine which event that goes to which pipeline. You the distribute data using pipeline-to-pipeline communication.

Another way would be to store the index name or pattern in a field in the event metadata. You can then use this field in a single elasticsearch output to generate a single bulk request that targets all the indices in one go.

1 Like

Hey @Christian_Dahlqvist thanks for the advice. I tried some of the logstash changes as that seemed like the lowest hanging fruit but no dice.

I'm going to look into reducing/consolidating the number of indices and the multiple pipelines thing.

Thanks for the advice, hopefully one of these suggestions help!

Did you add timestamps to verify where the delay occurs? If so, does delay vary for different types of data?

Did you add timestamps to verify where the delay occurs? If so, does delay vary for different types of data?

I am doing this now to try and pinpoint the delay. Interesting point on the data, I would not suspect so but will let you know.

Example request with timestamps:

syslog timestamp: Feb 4, 2022 @ 13:26:35.202
fluentd timestamp: 2022-02-04 13:26:35
logstash_processed_at timestamp: 13:27:36.022

We can see the exact delay is represented in the difference between the fluent and logstash timestamps.

I think this means logstash is holding the data. I don't think there is a delay from the time it leaves fluent and arrives in logstash, and I also think this confirms that the delay is not from logstash -> ES because the logstash processed time is the exact time it appears in Kibana.

Is the Logstash timestamp assigned immediately after the input plugins? Are you setting a timestamp in an ingest pipeline as well so we know when it reached Elasticsearch? Have you verified that the time is synchronised across all hosts?

Have you tried to increase batch_size? Try with pipeline.batch.size: 1000
"workers" : 64 - is this too much? Default is CPU number

Also you can test will increase ES output in your case:

  • http_compression - if your line is longer, min 10K , maybe reduce network delay. I have a short log lines,around 1K, no improvements
  • networkaddress.cache.ttl

I just wanted to follow up here so it helps others that come across this thread.

We ended up writing a script on the logstash containers to send a payload directly to logstash and we found it was being indexed immediately. So that kind of eliminated logstash as the problem.

We then found via a TCPdump on the fluentd hosts that once the message appeared in the TCP stream it was arriving in logstash and being processed quickly and indexed.... but was taking 60+ seconds from the time an error was logged to appear in the TCP stream

so.... unfortunately while in my initial tests of fluentd it was indeed immediately processing the forwarding to a local file, it was actually getting held in a buffer before being forwarded to an external host. Specifically, looking at Config: Buffer Section - Fluentd - the default flush_interval is 60s, which kind of explains why it was taking just over 60 seconds consistently.

Adding a buffer to fluend conf and modifying these values in the fixed the delay.

Thank you @Christian_Dahlqvist and @Rios for looking and trying to help. I apologize that my tests were incorrect which led me to believe it was logstash that was the component where the problem should be solved, and not fluentd.

helper logstash processing snippet:

  mutate {
    add_field => { "realtimestamp" => "%{@timestamp}" }
        }

    ruby {
        code => "event.set('logstash_processed_at', Time.now());"
        }

    mutate {
    convert => { "logstash_processed_at" => "string" }
        }

    date {
    match => [ "logstash_processed_at", "ISO8601" ]
    target => "@timestamp"
    add_tag => "timestamp_changed"
    }

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.