Duplicate logs with ingest pipeline

I have a EFK cluster with version 6.8.2 and everything was running smoothly until I had a requirement to create specific type of index (project-large_payload-$namespace) based on a field length.
For this, I used script_processor to create an ingest_pipeline which would check the length of my field and update the document accordingly (i.e. set ctx._index=project-large_payload-$namespace*).
The pipeline is working fine in itself and I can see new index being created with logs where field.length > 10Kb are being pushed.
However, I can see in default indices (project-$namespace-* indices) , the same document is present, i.e. the docs were just "copied" across and not "moved".

Any idea how can I restrain filebeat to push the docs only to large_payload index ?

My script is:

    "processors" : [
            "script" : {
              "source" : """
              if(ctx['@timestamp'] != null){ 
    def year = ctx['@timestamp'].substring(0,4);
                       def month = ctx['@timestamp'].substring(5,7);
                       def date = ctx['@timestamp'].substring(8,10);
    if(ctx.payload.length() > params.large_payload_size) {StringBuffer buff = new StringBuffer('project-'.concat(ctx.kubernetes.namespace.concat('-')).concat(ctx.beat.version).concat('-').concat(year).concat('.').concat(month).concat('.').concat(date)); ctx._index = buff.insert(buff.indexOf('-')+1,'large_payload-').toString();}}
              "params" : {
                "large_payload_size" : 10240

As an update, I have resolved the issue by putting an additional processor on my pipeline and I don't see any duplicates coming in. However, there is another log duplicacy seen in the environment.
When the filebeat pods are restarted, they tend to read all the log files from configured path again from HEAD and thus push duplicate logs to ES.
Imagine if I have 100 pods running on my host machine for about 2 months and I restart filebeat on that host, it will take atleast 1 full day to first process all the logs (duplicates) and then tailing recent logs (already backlogged for a day at this time).

Is there any way to stop reading the log files again on restarting filebeat and thus avoiding duplicate logging ?
My filebeat-config.yml is:

    # Mounted `filebeat-inputs` configmap:
    path: /usr/share/filebeat/inputs.d/*.yml
    # Reload inputs configs as they change:
    reload.enabled: false
    path: /etc/filebeat/modules.d/**/*.yml
    # Reload module configs as they change:
    reload.enabled: false

# To enable hints based autodiscover, remove `filebeat.config.inputs` configuration and uncomment this:
#  providers:
#    - type: kubernetes
#      hints.enabled: true

  - add_cloud_metadata: ~

cloud.id: ${ELASTIC_CLOUD_ID}
cloud.auth: ${ELASTIC_CLOUD_AUTH}
  protocol: https
  hosts: ['${ELASTICSEARCH_HOST:elasticsearch}:${ELASTICSEARCH_PORT:9200}']
    enabled: true
    client_authentication: required
    certificate_authorities: [ "/etc/pki/client/client-ca.crt" ]
    certificate: /etc/pki/client/client.crt
    key: /etc/pki/client/client.key
  index: "%{[index.name]:project}-%{[kubernetes.namespace]:filebeat}-%{[beat.version]}-%{+yyyy.MM.dd}"
    - index: "%{[kubernetes.labels.type]}-%{[kubernetes.namespace]}-%{[beat.version]}-%{+yyyy.MM.dd}"
        kubernetes.labels.type: "mq"
    - index: "%{[kubernetes.labels.app.value]:storageos}-%{[kubernetes.namespace]}-%{[beat.version]}-%{+yyyy.MM.dd}"
        kubernetes.labels.storageos_cr: "storageos"
    - pipeline: "large_payload"
      when.has_fields: ["payload","@timestamp"]

and filebeat-inputs.yml is:

- type: docker
  combine_partial: true
  json.keys_under_root: true
  json.add_error_key: true
  json.ignore_decoding_error: true
  scan_frequency: 1s
  close_inactive: 10m
    path: "/var/lib/docker/containers"
    stream: "all"
      - "*"
    - add_kubernetes_metadata:
        in_cluster: true
          - logs_path:
              logs_path: "/var/log/containers"
    - drop_event:
            - contains:
                kubernetes.pod.name: "build"
            - contains:
                kubernetes.pod.name: "deploy"


Maybe this happens because the registry is being removed along with the pod, and when the pod starts again doesn't know the previous state. Maybe tail_files can help you here.


This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.