Logstash fingerprint not able to remove duplicate

Hi,

We have been using logstash fingerprint plugin since quite sometime. We now have a specific requirement of creating fingerprint hashkey using multiple fields of Elasticsearch indexes, but we are stuck in two cases as detailed below,

If concatenate_sources = False,
Even if we provide multiple fields, it seems to use only last field to generate hashkey.

If concatenate_source = True,
If using multiple fields, duplicate data is eliminated when data is consumed from a single file in s3. If similar duplicate data is consumed from multiple S3 files, it fails to eliminate duplicacy.

Please suggest how to successfully eliminate duplicacy while reading data from several files in S3 bucket

Which fields are you specifying for the hash?

Hello Christian,

Please find below sample data and our filter inclusive of fingerprint config

{"logtype": "logs","servicename": "Ref","deviceinfo": {"productid": "ARTIK051","policyversion": 1541130000,"devicetype": "Ref","stage": "PROD","platform": "RT 1.0","di": "d363022f-a05f-be2b-0d11-79f783d308df","uid": "e8rnuugabc","serialnumber": "343333443333335","macaddress": "999999999999","mnmo": "ARTIK051|00108441|000008500012115A0100000000000000"},"ipaddress":"", "versioninfo": {"wifi": "991120190702","firmware": "99121706, FFFFFF"},"eventlist": [{"name": "E01","dt": "2019-08-25 22:36:02"}]}

filter {

    split {
            field => "eventlist"
    }
    fingerprint {
      target => "[@metadata][fingerprint]"
      source => ["deviceinfo[di]","@timestamp","eventlist[name]"]
      #source => ["eventlist[name]"]
      method => "SHA1"
      key => "Log analytics"
      base64encode => true
      concatenate_sources => true
    }
    json {
            source => "message"
            target => ""
    }
    date {
            match => ["eventlist[dt]", "YYYY-MM-dd HH:mm:ss"]
    }
    mutate {
            remove_field => ["eventlist[logencoding]","eventlist[dt]","@version"]
    }

}

When above duplicate entries of sample data is read from same file, duplicacy is removed but if read from different file, duplicacy remains and we need to read data from different files and eliminate duplicacy.

Thanks

Are all the fields that make up the hash identical? Can you show an example of two duplicate events that have been indexed?

How come you are running the JSON filter after fingerprint? Does the fields even exist when the fingerprint filter is run?

Yes, those are identical. Below is an example of duplicate data

Time @timestamp deviceinfo.di eventlist.name _id
Aug 26, 2019 @ 04:06:02.000 Aug 26, 2019 @ 04:06:02.000 d363022f-a05f-be2b-0d11-79f783d308df E01 8bl+kCq78C2adJis6lIAaLFimPo=
Aug 26, 2019 @ 04:06:02.000 Aug 26, 2019 @ 04:06:02.000 d363022f-a05f-be2b-0d11-79f783d308df E01 X3rut13vaNd3fd2NzjNLfX+V+pg=

Please show full indexed events from Elasticsearch in JSON. Make sure the document id is shown.

What does the JSON filter parse out from the message field?

{
"_index": "logs-2019.08.25",
"_type": "_doc",
"_id": "8bl+kCq78C2adJis6lIAaLFimPo=",
"_version": 1,
"_score": null,
"_source": {
"servicename": "Ref",
"versioninfo": {
"wifi": "991120190702",
"firmware": "99121706, FFFFFF"
},
"logtype": "logs",
"@timestamp": "2019-08-25T22:36:02.000Z",
"ipaddress": "",
"deviceinfo": {
"macaddress": "999999999999",
"di": "d363022f-a05f-be2b-0d11-79f783d308df",
"stage": "PROD",
"productid": "ARTIK051",
"platform": "RT 1.0",
"uid": "e8rnuugabc",
"serialnumber": "343333443333335",
"devicetype": "Ref",
"mnmo": "ARTIK051_REF_17K|00108441|000008500012115A0100000000000000",
"policyversion": 1541130000
},
"eventlist": {
"name": "ES01"
}
},
"fields": {
"@timestamp": [
"2019-08-25T22:36:02.000Z"
]
},
"sort": [
1566772562000
]
}

{
"_index": "logs-2019.08.25",
"_type": "_doc",
"_id": "X3rut13vaNd3fd2NzjNLfX+V+pg=",
"_version": 1,
"_score": null,
"_source": {
"servicename": "Ref",
"versioninfo": {
"wifi": "991120190702",
"firmware": "99121706, FFFFFF"
},
"logtype": "logs",
"@timestamp": "2019-08-25T22:36:02.000Z",
"ipaddress": "",
"deviceinfo": {
"macaddress": "999999999999",
"di": "d363022f-a05f-be2b-0d11-79f783d308df",
"stage": "PROD",
"productid": "ARTIK051",
"platform": "RT 1.0",
"uid": "e8rnuugabc",
"serialnumber": "343333443333335",
"devicetype": "Ref",
"mnmo": "ARTIK051_REF_17K|00108441|000008500012115A0100000000000000",
"policyversion": 1541130000
},
"eventlist": {
"name": "ES01"
}
},
"fields": {
"@timestamp": [
"2019-08-25T22:36:02.000Z"
]
},
"sort": [
1566772562000
]
}

What does the JSON filter do? Are the fields available when fingerprint is run?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.