Duplicate logs in kibana

Hello,
I have an issue:
Logs injects more than one time.
It has different IDs but log contains the same data:




I use FluentD to collect logs from the k8s cluster with this config:

extraConfigMaps:
  containers.input.conf: |-
    <source>
      @id fluentd-containers.log
      @type tail
      path /var/log/containers/*.log
      pos_file /var/log/es-containers.log.pos
      tag raw.kubernetes.*
      read_from_head true
      <parse>
        @type multi_format
        <pattern>
          format json
          time_key time
          time_format %Y-%m-%dT%H:%M:%S.%NZ
        </pattern>
        <pattern>
          format /^(?<time>.+) (?<stream>stdout|stderr) [^ ]* (?<log>.*)$/
          time_format %Y-%m-%dT%H:%M:%S.%N%:z
        </pattern>
      </parse>
    </source>
    # Detect exceptions in the log output and forward them as one log entry.
    <match raw.kubernetes.**>
      @id raw.kubernetes
      @type detect_exceptions
      remove_tag_prefix raw
      message log
      stream stream
      multiline_flush_interval 5
      max_bytes 500000
      max_lines 1000
    </match>
  system.input.conf: |-
    <source>
      @id systemd.log
      @type systemd
      tag systemd
      read_from_head true
      <storage>
        @type local
        persistent true
        path /var/log/systemd.log.pos
      </storage>
      <entry>
        field_map {"MESSAGE": "log", "_PID": ["process", "pid"], "_CMDLINE": "process", "_COMM": "cmd"}
        field_map_strict false
        fields_strip_underscores true
        fields_lowercase true
      </entry>
    </source>
    # Example:
    # I1118 21:26:53.975789       6 proxier.go:1096] Port "nodePort for kube-system/default-http-backend:http" (:31429/tcp) was open before and is still needed
    <source>
      @id kube-proxy.log
      @type tail
      format multiline
      multiline_flush_interval 5s
      format_firstline /^\w\d{4}/
      format1 /^(?<severity>\w)(?<time>\d{4} [^\s]*)\s+(?<pid>\d+)\s+(?<source>[^ \]]+)\] (?<log>.*)/
      time_format %m%d %H:%M:%S.%N
      path /var/log/kube-proxy.log
      pos_file /var/log/es-kube-proxy.log.pos
      tag kubeproxy
      read_from_head true
    </source>
  monitoring.conf: |-
    <source>
      @type prometheus_output_monitor
    </source>
    <source>
      @type prometheus
      bind 0.0.0.0
      port 24231
      metrics_path /metrics
    </source>
  output.conf: |-
    <filter kubeproxy>
      @type record_transformer
      enable_ruby
      <record>
        hostname ${ENV["HOSTNAME"]}
      </record>
    </filter>
    
    <filter **>
      @type prometheus
      <metric>
        type counter
        name fluentd_input_status_num_records_total
        desc Total number of log entries generated by either application containers or system components
      </metric>
    </filter>
    <filter kubernetes.**>
      @type kubernetes_metadata
    </filter>
    <match **>
      @id elasticsearch
      @type elasticsearch
      @log_level info
      include_tag_key true
      host ${HOST}
      port 9200
      scheme https
      ssl_verify false
      ssl_version TLSv1_2
      user ${USERNAME}
      password ${PASSWORD}
      logstash_format true
      logstash_prefix ${INDEX_NAME}
      <buffer>
        @type file
        path /var/log/fluentd-buffers/kubernetes.system.buffer
        flush_mode interval
        retry_type exponential_backoff
        flush_thread_count 2
        flush_interval 5s
        retry_forever
        retry_max_interval 30
        chunk_limit_size 2M
        queue_limit_length 8
        overflow_action block
      </buffer>
    </match>

I've never worked with fluentd but it seems as the same event is pulled or pushed multiple times this can happen if you have multiple harvesters that ingest from the same source or the system sends the event multiple times. Because the UUID generation is different for the same event: My approach with the fingerprinting below always creates the same hash based on the same content originating from the source fields.

Quick workaround:
You could create in your ingest pipeline a custom fingerprint based on one or more source fields (message, field.other, etc.) and use as the target the "_id" field. So identical events will have the same hash value based on the combination of the content of the source fields specified

The disadvantage is you loose the optimized UUID generation for the "_id" field that can impact performance and storage.

1 Like

Thank you for the answer, I'll try to investigate.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.