Duplicate Entries in Filebeat Registry File For Kubernetes Pod Restart

pulkit007 · July 24, 2020, 7:55pm

Hi,

I am running filebeat in a Docker Container on Kubernetes Cluster for Processing Logs of our Application and send them to Logstash. Our Log Data is stored in PV, so I ran only one pod of filebeat that takes logs from that PV, process it and send them to Logstash. I am storing Filebeat Registry file in our PV so that in case of any restart, it should start from wherever it left.

Now the problem arises with Pod restarts. As the new pod may be deployed on any node in kubernetes cluster, the device field in Filebeat Registry file gets changed and it is not able to retrieve offset information for a log file with the new (device + inode) combination, so it creates new entry for that log file in registry file with the new (device + inode) combination.

Registry (data.json) file before pod restart :

{
    "source": "my_service_.2020-06-16.27.log",
    "offset": 158130,
    "timestamp": "2020-07-24T18:43:28.716910538Z",
    "ttl": -1,
    "type": "log",
    "meta": null,
    "FileStateOS": {
        "inode": 4187121761,
        "device": 2097258
    }
}

Registry (data.json) file after pod restart :

{
    "source": "my_service_.2020-06-16.27.log",
    "offset": 158130,
    "timestamp": "2020-07-24T18:43:28.716910538Z",
    "ttl": -1,
    "type": "log",
    "meta": null,
    "FileStateOS": {
        "inode": 4187121761,
        "device": 2097258
    }
},
{
    "source": "my_service_.2020-06-16.27.log",
    "offset": 210589,
    "timestamp": "2020-07-24T18:53:59.49175077Z",
    "ttl": -1,
    "type": "log",
    "meta": null,
    "FileStateOS": {
        "inode": 4187121761,
        "device": 8388788
    }
}

This duplicate entry in the Registry file causes it to reprocess that log file again.. leading to duplicate logs in Elasticsearch.

As our log data is very large.. we cannot afford to reprocess all the log files for every pod restart.

Is there any solution to this problem??
Is there any way in which device filed can be kept as same over pod restarts??

Please provide solution asap as this is a very crucial part in our deployment.

mtojek · July 27, 2020, 9:10am

Take a look at file_identity described here: https://www.elastic.co/guide/en/beats/filebeat/master/filebeat-input-log.html

pulkit007 · July 27, 2020, 4:45pm

Hi Marcin,

Thanks for the link.

I ran "lsblk -o MOUNTPOINT,UUID" in Kubernetes cluster. Output was bunch of empty lines. So, I could not continue with inode_marker as file_identity. Is there any other soultion for this?

Meanwhile, I tried to run filebeat in my local using path as file_identity. After filebeat processed the log file completely, I renamed the log file to a new name. So, as per the logic, filebeat should start harvesting this file again (as the path has changed and it is using path as file_identity), but harvesting didn’t started. I checked the registry file and it is using the inode and device for FileStateOS.

My filebeat.yml looks like this :

filebeat.inputs:
  - type: log
    file_identity.path: ~
    recursive_glob.enabled: true
    paths:
      - ../Logs/**/*.log*

So, why the path as file_identity is not working ?

pulkit007 · July 28, 2020, 2:01pm

I used path as file_identity in Kubernetes also.

I added file_identity.path: ~ in filebeat log input and deployed it to Kubernetes. After some time, when the processing of all files was completed, I restarted the pod. Then filebeat started processing all the files once again. But this time registry file after pod restart does not contains values with old device id, it contains values with the new device id only.

Registry (data.json) file before pod restart :

{
    "source": "my_service_.2020-05-08.12.log",
    "offset": 52433912,
    "timestamp": "2020-07-28T17:24:16.527054115Z",
    "ttl": -1,
    "type": "log",
    "meta": null,
    "FileStateOS": {
        "inode": 4180331148,
        "device": 5242882
    }
}

Registry (data.json) file after pod restart :

{
    "source": "my_service_.2020-05-08.12.log",
    "offset": 94627,
    "timestamp": "2020-07-28T13:42:50.376340649Z",
    "ttl": -1,
    "type": "log",
    "meta": null,
    "FileStateOS": {
        "inode": 4180331148,
        "device": 3145768
    }
}

Still reprocessing of all files happened which lead to duplicated records in ES.

So, why the path as file_identity is not working ?

kvch · July 30, 2020, 3:04pm

file_identity is going to be released in 7.9. What version are you running?

pulkit007 · July 30, 2020, 3:23pm

I am using Filebeat 7.8.1

Will wait for Filebeat 7.9

system · August 27, 2020, 5:23pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Filebeat not considering registry and resending data Beats docker , filebeat	2	674	March 29, 2021
Filebeat sends same logs to Logstash on every restart Beats filebeat	4	676	June 30, 2020
Filebeat pushes old logs on restart even though registry is persisted and functional Beats filebeat	1	1034	April 15, 2020
Filebeat rc1 resends full data set data upon random restart (registry file not updated properly)? Beats filebeat	20	3167	November 11, 2016
Filebeats is re-processing logs once it restarts Beats filebeat	6	4783	April 18, 2018

Duplicate Entries in Filebeat Registry File For Kubernetes Pod Restart

Related topics