Time delay between logstash and elasticsearch

My current pipeline is:
filebeat->Logstash->ES(3 nodes).

Filebeat and Logstash are deployed in the kubernetes cluster, both of the them are version-7.6.2.
ES is deployed as a container on a virtual machine, images version is [amazon/opendistro-for-elasticsearch:1.6.0]

I found that the logs on ES have a delay of about 7~8 minutes.

Logstash config:

 inputs:
    main: |-
      input {
        beats {
          port => 5044
        }
      }

  filters:
    # main: |-
    #   filter {
    #   }

  outputs:
    main: |-
      output {
        elasticsearch {
          hosts => ["${ELASTICSEARCH_HOST}:${ELASTICSEARCH_PORT}"]
          manage_template => false
          index => "XXX_%{+YYYY.MM.dd}"
          user => "XXXX"
          password => "XXX"
          ilm_enabled => false
        }
      }

more infomation about logstash:

$ cat logstash.yml 
config.reload.automatic: true
http.host: 0.0.0.0
http.port: 9600
path.config: /usr/share/logstash/pipeline
path.data: /usr/share/logstash/data
queue.checkpoint.writes: 1
queue.drain: true
queue.max_bytes: 1gb
queue.type: persisted

$ curl -XGET 'localhost:9600/_node/stats/os?pretty'
{
  "host" : "log-logstash-0",
  "version" : "7.6.2",
  "http_address" : "0.0.0.0:9600",
  "id" : "9a699ca0-63ff-4d4c-bfda-6c1672603009",
  "name" : "log-logstash-0",
  "ephemeral_id" : "cacf7e7a-0a5e-426b-93d1-b50d72037c07",
  "status" : "green",
  "snapshot" : false,
  "pipeline" : {
    "workers" : 2,
    "batch_size" : 125,
    "batch_delay" : 50
  },
  "os" : {
    "cgroup" : {
      "cpu" : {
        "cfs_quota_micros" : 200000,
        "stat" : {
          "time_throttled_nanos" : 10805873439,
          "number_of_times_throttled" : 198,
          "number_of_elapsed_periods" : 699951
        },
        "control_group" : "/",
        "cfs_period_micros" : 100000
      },
      "cpuacct" : {
        "usage_nanos" : 3358394995977,
        "control_group" : "/"
      }
    }
  }
}

According to prometheus indicators, logstash has a low load(cpu: 0.035, memory: 1GB).
I also made a pressure test on ES, the throughput is about 6000/s.

How can I figure out where the problem currently resides?

Are you setting @timestamp to a value from the log or the ingest time? If ingest time, compare @timestamp to the value in the log.

What is the index refresh_interval?

If you have Kibana monitoring enabled, what is the indexing latency? Monitoring for logstash, if enabled, also has event latency.

Are you using ntp everywhere?

1 Like

It seems you have set this to ensure as high durability as possible. Note that this also will result in increased disk I/O and potentially lower throughput as a result. Does the storage you are using for the Logstash node have capacity to handle this or might it be a bottleneck?

1 Like

@rugenl Thanks for your reply

I'm setting @timestamp to a value from the ingest time. The ingest time is about 3 seconds slower than the log time

The index refresh_interval is not specified, Use the default value(1s).

GET <index>/_settings?pretty
{
  "XXX" : {
    "settings" : {
      "index" : {
        "creation_date" : "1586822809255",
        "number_of_shards" : "1",
        "number_of_replicas" : "1",
        "uuid" : "t3dhllx5SfG2XzKakS-RSw",
        "version" : {
          "created" : "7060199"
        },
        "provided_name" : "XXX"
      }
    }
  }
}

We did not enable monitoring.

Yes.

@Christian_Dahlqvist Thanks for your reply.
The value of iowait is 6.5% ~ 7.5 on Logstash node. Disk write speed is 80KB/s and read speed is 20KB/S, so I think the disk is not the bottleneck.

BTW: A few days after Logstash restarts, the problem will reproduce. The problem will not reproduce immediately.
I have 4 kubernetes clusters, I restarted logstash on one of the clusters two days ago, but the problem has not been reproduced.

That much iowait could be noticeable. Are you monitoring iowait so you can see whether the amount of iowait varies over time? Does this behaviour change if you increase the write checkpoint size?

@Christian_Dahlqvist Thanks for your advise.I changed the value of queue.checkpoint.writes from 1 to 1024. The value of iowait is significantly lower than before.
But the logs on ES still have a delay of about 7~8 minutes.

Why is it reporting being throttled? How much resurces does it have available? Is the time synchronized across the hosts/pods?

Host resource: 4Core / 16G MEM
Pod resource: 2Core / 2G MEM

Time is synchronized across the hosts/pods.

$ date; sudo docker exec 0723956d651b date
Tue Apr 14 16:43:04 CST 2020
Tue Apr 14 08:43:04 UTC 2020

Are all pods running on this host, including the application generating logs and Elasticsearch?

If so, what does CPU usage on the underlying host look like?

No, filebeat, logstash and application pods are on the same host.
The Elasticsearch cluster is outside the k8s cluster.

The architecture is as follows:
elk

I have tuned logstash performance with follow configuration:

$ cat logstash.yml 
......
pipeline.batch.delay: 100
pipeline.batch.size: 250
pipeline.workers: 4
queue.checkpoint.writes: 4096
......

Now it looks like there is no delay, the amount of iowait has decreased.
Thanks for your help.
@Christian_Dahlqvist
@rugenl

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.