Duplicate Entries in Filebeat Registry File For Kubernetes Pod Restart

Hi,

I am running filebeat in a Docker Container on Kubernetes Cluster for Processing Logs of our Application and send them to Logstash. Our Log Data is stored in PV, so I ran only one pod of filebeat that takes logs from that PV, process it and send them to Logstash. I am storing Filebeat Registry file in our PV so that in case of any restart, it should start from wherever it left.

Now the problem arises with Pod restarts. As the new pod may be deployed on any node in kubernetes cluster, the device field in Filebeat Registry file gets changed and it is not able to retrieve offset information for a log file with the new (device + inode) combination, so it creates new entry for that log file in registry file with the new (device + inode) combination.

Registry (data.json) file before pod restart :

{
    "source": "my_service_.2020-06-16.27.log",
    "offset": 158130,
    "timestamp": "2020-07-24T18:43:28.716910538Z",
    "ttl": -1,
    "type": "log",
    "meta": null,
    "FileStateOS": {
        "inode": 4187121761,
        "device": 2097258
    }
}

Registry (data.json) file after pod restart :

{
    "source": "my_service_.2020-06-16.27.log",
    "offset": 158130,
    "timestamp": "2020-07-24T18:43:28.716910538Z",
    "ttl": -1,
    "type": "log",
    "meta": null,
    "FileStateOS": {
        "inode": 4187121761,
        "device": 2097258
    }
},
{
    "source": "my_service_.2020-06-16.27.log",
    "offset": 210589,
    "timestamp": "2020-07-24T18:53:59.49175077Z",
    "ttl": -1,
    "type": "log",
    "meta": null,
    "FileStateOS": {
        "inode": 4187121761,
        "device": 8388788
    }
}

This duplicate entry in the Registry file causes it to reprocess that log file again.. leading to duplicate logs in Elasticsearch.

As our log data is very large.. we cannot afford to reprocess all the log files for every pod restart.

  • Is there any solution to this problem??

  • Is there any way in which device filed can be kept as same over pod restarts??

Please provide solution asap as this is a very crucial part in our deployment.

Take a look at file_identity described here: https://www.elastic.co/guide/en/beats/filebeat/master/filebeat-input-log.html

Hi Marcin,

Thanks for the link.

I ran "lsblk -o MOUNTPOINT,UUID" in Kubernetes cluster. Output was bunch of empty lines. So, I could not continue with inode_marker as file_identity. Is there any other soultion for this?

Meanwhile, I tried to run filebeat in my local using path as file_identity. After filebeat processed the log file completely, I renamed the log file to a new name. So, as per the logic, filebeat should start harvesting this file again (as the path has changed and it is using path as file_identity), but harvesting didn’t started. I checked the registry file and it is using the inode and device for FileStateOS.

My filebeat.yml looks like this :

filebeat.inputs:
  - type: log
    file_identity.path: ~
    recursive_glob.enabled: true
    paths:
      - ../Logs/**/*.log*

So, why the path as file_identity is not working ?

I used path as file_identity in Kubernetes also.

I added file_identity.path: ~ in filebeat log input and deployed it to Kubernetes. After some time, when the processing of all files was completed, I restarted the pod. Then filebeat started processing all the files once again. But this time registry file after pod restart does not contains values with old device id, it contains values with the new device id only.

Registry (data.json) file before pod restart :

{
    "source": "my_service_.2020-05-08.12.log",
    "offset": 52433912,
    "timestamp": "2020-07-28T17:24:16.527054115Z",
    "ttl": -1,
    "type": "log",
    "meta": null,
    "FileStateOS": {
        "inode": 4180331148,
        "device": 5242882
    }
}

Registry (data.json) file after pod restart :

{
    "source": "my_service_.2020-05-08.12.log",
    "offset": 94627,
    "timestamp": "2020-07-28T13:42:50.376340649Z",
    "ttl": -1,
    "type": "log",
    "meta": null,
    "FileStateOS": {
        "inode": 4180331148,
        "device": 3145768
    }
}

Still reprocessing of all files happened which lead to duplicated records in ES.

So, why the path as file_identity is not working ?

file_identity is going to be released in 7.9. What version are you running?

I am using Filebeat 7.8.1

Will wait for Filebeat 7.9

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.