Logstash bulk requests growing after some time

hi,

I have a strange situation with logstash with elasticsearch and maybe someone can help me out.
My logstash bulk requests to elasticsearch are growing after some hours. (2-24h)

There are arround 1000 winlogbeat agents connected to logstash with beats input.

I use 18 logstash nodes in k8s with 8gb jvm heap space per node, 8 workers and batch size 1250 with delay 50.
After restarting all logstash nodes, my pipeline push duration to elasticsearch is arround 10-30ms in the beginning.

After some time the bulk requests are growing in logstash and also other metrics.

But there are no issues or errors in logstash and elasticsearch logs at this time.

The pipeline push duration and jvm garbage collection is growing in logstash.




But in elasticsearch there are no difference at the same time.
I mean there are not more events written to the index or garbage collection growing or some like that.

Here are some elasticsearch metrics.



indexing elasticsearch:

For me it look like a internal logstash queue event loop or some like that which is growing slowly.. or the agents sending more events after some time?
Because all metrics from logstash are growing at same time slowly. (events filtered, events in, events out, and so on ...)
But then i should also see more doc counts in the index. :confused:

I can not understand why this logstash metrics are growing to elasticsearch without growing index write metrics.
There is also no bigger load of index doc counts in the winlogbeat index at the same time period.

Here my config: (logstash version 7.17.8)

    input {
      beats {
        port => "5044"
        ssl => true
        ssl_certificate_authorities => "xx"
        ssl_certificate => "xx"
        ssl_key => "xx"
        ssl_verify_mode => "force_peer"
        ecs_compatibility => "v1"
        tags => ["beats"]
       }
    }
    filter {
        if "beats" in [tags] and "prod" not in [fields][env] and "xxx" not in [fields][token] {
          drop {}
        }
        # create fingerprint of message avoiding duplicate events
        fingerprint {
          source => "message"
          target => "[@metadata][fingerprint]"
          method => "SHA1"
          key => "xxx"
          base64encode => true
        }
        # drop broadcasts from filtering platform
        if "beats" in [tags] and "Filtering Platform Connection" in [event][action] {
          mutate {
            add_field => { "WindowsFilteringPlatform" => "true"}
          }
        }
        if "true" in [WindowsFilteringPlatform] and [message] =~ /someips/ {
          drop {}
        }
        # cancel older events
        if "beats" in [tags] {
          ruby {
            init => "require 'time'"
            code => 'if LogStash::Timestamp.new(event.get("@timestamp")+86400) < ( LogStash::Timestamp.now)
              event.cancel
            end'
          }
        }
    }
    output {
     if [@metadata][beat] == "winlogbeat" {
        elasticsearch {
          hosts => ["https://xxx:9200"]
          ilm_enabled => true
          ilm_rollover_alias => "winlogbeat"
          ilm_pattern => "{now/d}-000001"
          ilm_policy => "winlogbeat"
          ssl_certificate_verification => false
          user => "xxx"
          password => "xxx"
          ecs_compatibility => "v1"
          document_id => "%{[@metadata][fingerprint]}"
        }
      }

  logstash.yml: |
    http.host: "0.0.0.0"
    pipeline.ecs_compatibility: "disabled"
    pipeline.batch.size: "1250"
    pipeline.batch.delay: "50"
    pipeline.workers: "8"

Elasticsearch Cluster and winlogbeat agents are also running in the same version. (7.17.8)

The Elasticsearch Cluster is running with 6 Master nodes (2gb/jvm per node), 3 coordinating nodes 2gb/jvm ), 9 hot nodes (8gb/jvm), 6 warm nodes (8gb/jvm), there will come also 2 cold nodes next time.

The Index template with the ilm policy is in the hot state starting with 9 shards, 1 replica and in the warm state (after 24h) shrinked down to 6 shards.. and so on ..
Index Refresh time is set to 60 seconds.

I tried already different batch sizes and workers in logstash, more nodes and less nodes, more memory and less memory .. all the same situation.
I have this problem also on smaller setups and solutions like just 20 filebeat agents with just 20 gb per day.

Maybe it is a bug or i miss something, i dont know.
Can maybe someone say whats going on here?

Many thanks!!!!

I found out, this issue is monitoring related and not a issue from logstash.
I have to check now why the logstash metric counts are cutted some times and the total counter is not showing the right value.

This topic can be closed..

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.