Duplicate logs with ingest pipeline

Ayush_Mathur · February 24, 2020, 12:48pm

I have a EFK cluster with version 6.8.2 and everything was running smoothly until I had a requirement to create specific type of index (project-large_payload-$namespace) based on a field length.
For this, I used script_processor to create an ingest_pipeline which would check the length of my field and update the document accordingly (i.e. set ctx._index=project-large_payload-$namespace*).
The pipeline is working fine in itself and I can see new index being created with logs where field.length > 10Kb are being pushed.
However, I can see in default indices (project-$namespace-* indices) , the same document is present, i.e. the docs were just "copied" across and not "moved".

Any idea how can I restrain filebeat to push the docs only to large_payload index ?

My script is:

    "processors" : [
          {
            "script" : {
              "source" : """
              if(ctx['@timestamp'] != null){ 
    def year = ctx['@timestamp'].substring(0,4);
                       def month = ctx['@timestamp'].substring(5,7);
                       def date = ctx['@timestamp'].substring(8,10);
    if(ctx.payload.length() > params.large_payload_size) {StringBuffer buff = new StringBuffer('project-'.concat(ctx.kubernetes.namespace.concat('-')).concat(ctx.beat.version).concat('-').concat(year).concat('.').concat(month).concat('.').concat(date)); ctx._index = buff.insert(buff.indexOf('-')+1,'large_payload-').toString();}}
    """,
              "params" : {
                "large_payload_size" : 10240
              }
            }
          }]

Ayush_Mathur · February 25, 2020, 3:39pm

As an update, I have resolved the issue by putting an additional processor on my pipeline and I don't see any duplicates coming in. However, there is another log duplicacy seen in the environment.
When the filebeat pods are restarted, they tend to read all the log files from configured path again from HEAD and thus push duplicate logs to ES.
Imagine if I have 100 pods running on my host machine for about 2 months and I restart filebeat on that host, it will take atleast 1 full day to first process all the logs (duplicates) and then tailing recent logs (already backlogged for a day at this time).

Is there any way to stop reading the log files again on restarting filebeat and thus avoiding duplicate logging ?
My filebeat-config.yml is:

  inputs:
    # Mounted `filebeat-inputs` configmap:
    path: /usr/share/filebeat/inputs.d/*.yml
    # Reload inputs configs as they change:
    reload.enabled: false
  modules:
    path: /etc/filebeat/modules.d/**/*.yml
    # Reload module configs as they change:
    reload.enabled: false

# To enable hints based autodiscover, remove `filebeat.config.inputs` configuration and uncomment this:
#filebeat.autodiscover:
#  providers:
#    - type: kubernetes
#      hints.enabled: true

processors:
  - add_cloud_metadata: ~


cloud.id: ${ELASTIC_CLOUD_ID}
cloud.auth: ${ELASTIC_CLOUD_AUTH}
output.elasticsearch:
  protocol: https
  hosts: ['${ELASTICSEARCH_HOST:elasticsearch}:${ELASTICSEARCH_PORT:9200}']
  username: ${ELASTICSEARCH_USERNAME}
  password: ${ELASTICSEARCH_PASSWORD}
  ssl:
    enabled: true
    client_authentication: required
    certificate_authorities: [ "/etc/pki/client/client-ca.crt" ]
    certificate: /etc/pki/client/client.crt
    key: /etc/pki/client/client.key
  index: "%{[index.name]:project}-%{[kubernetes.namespace]:filebeat}-%{[beat.version]}-%{+yyyy.MM.dd}"
  indices:
    - index: "%{[kubernetes.labels.type]}-%{[kubernetes.namespace]}-%{[beat.version]}-%{+yyyy.MM.dd}"
      when.equals:
        kubernetes.labels.type: "mq"
    - index: "%{[kubernetes.labels.app.value]:storageos}-%{[kubernetes.namespace]}-%{[beat.version]}-%{+yyyy.MM.dd}"
      when.equals:
        kubernetes.labels.storageos_cr: "storageos"
  pipelines:
    - pipeline: "large_payload"
      when.has_fields: ["payload","@timestamp"]

and filebeat-inputs.yml is:

- type: docker
  combine_partial: true
  json.keys_under_root: true
  json.add_error_key: true
  json.ignore_decoding_error: true
  scan_frequency: 1s
  close_inactive: 10m
  containers:
    path: "/var/lib/docker/containers"
    stream: "all"
    ids:
      - "*"
  processors:
    - add_kubernetes_metadata:
        in_cluster: true
        matchers:
          - logs_path:
              logs_path: "/var/log/containers"
    - drop_event:
        when:
          or:
            - contains:
                kubernetes.pod.name: "build"
            - contains:
                kubernetes.pod.name: "deploy"

ChrsMark · February 26, 2020, 9:16am

Hi!

Maybe this happens because the registry is being removed along with the pod, and when the pod starts again doesn't know the previous state. Maybe tail_files can help you here.

C.

system · March 25, 2020, 9:16am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Filebeat + elasticsearch make duplicates events Beats filebeat	20	3281	December 4, 2018
Filebeat+pipeline+es log duplication Help! Beats filebeat	2	306	June 23, 2023
Kibana Duplicate data Logs	2	1001	October 4, 2021
Filebeat duplicate logs Beats filebeat	8	934	January 10, 2023
Filebeat and Duplicate entries Beats filebeat	3	391	August 30, 2019

Duplicate logs with ingest pipeline

Related topics