Logstash sometimes sends duplicate entry to ES

Hi there,

I have a Metricbeat configured to send Beats to a single Logstash host, which in turn sends to an ES cluster.

Metricbeat config:

metricbeat.modules:
#------------------------------- System Module -------------------------------
- module: system
  metricsets:
    - cpu
    - load
    - diskio
    - filesystem
    - fsstat
    - memory
    - network
    - process

  enabled: true
  period: 60s
  processes: ['.*']

#------------------------------- Apache Module -------------------------------
- module: apache
  metricsets: ["status"]
  enabled: true
  period: 60s

  hosts: ["http://127.0.0.1"]
  server_status_path: "server-status"

#================================ General =====================================

name: myserver-dev
tags: ["test"]

#================================ Outputs =====================================

output.logstash:
  hosts: ["logstash1.private:5044"]
  loadbalance: true

Logstash pipeline:

input {
  beats {
    port => 5044
  }
}

filter {
  grok {
    match => { "message" => "%{COMBINEDAPACHELOG}"}
  }
  geoip {
    source => "clientip"
  }
}

output {
  elasticsearch {
    hosts => ["es1.private:9200","es2.private:9200","es3.private:9200"]
    index => "%{[@metadata][beat]}-%{+YYYY.MM.dd}"
    document_type => "%{[@metadata][type]}"
  }
}

It all works fine, except that occasionally Logstash will send a duplicate log to ES. The data in the duplicated log is all identical, except for the ID. I think this may have something to do with the ES load balancing.

Does anyone have any suggestions how I can diagnose what's going on?

If Beats or Logstash encounter any problems shipping data downstream, they will retry automatically. This means that duplicates can not be avoided in the pipeline. If you however define an ID based on the content of the event any attempt to write the same event twice would result in an update rather than a duplicate event getting created.

Thanks for the info!

Is there a recommended way of implementing this with Metricbeats? I'm not sure how I would uniquely identify the logs without encapsulating the entire beat somehow.

You can calculate the id in Logstash.

@fozboz
This was linked above but here is the explicit link to handle duplicates in Logstash:

I'm going with the "UUID" method to generate the ID in Logstash, which solves the problem shipping from Logstash to ES.

Might this still cause a duplicate if Beats encounters a problem shipping to Logstash?

UUID will only help prevent duplicates being created after it was assigned or if it is taken from an external system. If you use Logstash to create the UUID, you could still end up with duplicates if Beats is forced to retry and this results in duplication of the event.

So given that the event fields between each Metricbeat are different, how can I achieve this in a more elegant way than e.g.

filter {
  fingerprint {
    concatenate_sources => true
    target => "[@metadata][uuid]"
    method => "MURMUR3"
    if [@metadata][beat] == "metricbeat" {
      if [metricset][module] == "system" {
        if [metricset][name] == "process" {
          source => [ "[every][single]", "[event][field]", "[for][process]", "[listed][individually]", "[in]", "[a][big]", "[long][array]" ]
        }
        if [metricset][name] == "diskio" {
          source => [ "[every][single]", "[event][field]", "[for][diskio]", "[listed][individually]", "[in]", "[a][big]", "[long][array]" ]
        }
        if [metricset][name] == "network" {
          source => [ "[every][single]", "[event][field]", "[for][network]", "[listed][individually]", "[in]", "[a][big]", "[long][array]" ]
        }
      }
    }
  }
}

Is there a way I can say...

filter {
  fingerprint {
    concatenate_sources => true
    target => "[@metadata][uuid]"
    method => "MURMUR3"
    source => "[the][value][of][every][field][please]"
  }
}

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.