Problem using fingerprint for my documents stored in elasticsearch

MKH · January 21, 2022, 10:34pm

Hi,
I want to provide a unique id for the data I store in Elasticsearch index using fingerprint. But when I feed a bulk of docs to logstash I only one of the documents is stored in the index. I have 9 docs (9 json dictionary) stored in one file (myfile.json) and this file is the input to logstash. Here is my logstash config file:

input {
    file {
        start_position => "beginning"
        path => "/my_repo/myfile.json"
        sincedb_path => "/dev/null"     
    }
}
filter {
    json{
        source => "message"
    }
}
. . .
filter {
    date {
        match => ["start_time", 'UNIX_MS']
        target => "@start_time"
    }
}
. . .
filter {
    fingerprint {
        concatenate_sources => true
        source => ["id", "start_time"]
        target => "[@metadata][fingerprint]"
        method => "MURMUR3"
    }
}
output {
    elasticsearch{
        hosts => "muhost:port"
        user => "myname"
        password => "***"
        index=> myindex 
        document_id => "[@metadata][fingerprint]"        
    }
}

attribute "id" is the same in all documents but "start_time" is different in each document and I want to use a combination of both the provide a unique for each of the documents. When I feed the data to my index I see only one document and only the version of that document increases by 9 each time I do a new write (I use the same file with the same document to test how it works). I think there is something wrong in my fingerprint but I cannot understand what? Can someone please help?

Thanks.

leandrojmp · January 22, 2022, 2:49pm

Can you share a sample of the documents with the fields id and start_time as they appear in the json?

With this it is possible to try to simulate and pipeline and see what is wrong.

MKH · January 24, 2022, 2:26pm

Hi leandrojmp,

Thanks for your answer. Here is the sample of my documents:

{"owner": "me", "last_update": 1618568744821, "failed_runs": 20,  "total_runs_in_session": 115, "end_time": 1618568744000, "other_runs": 0, "passed_runs": 95, "session_status": "completed", "name": "m2e_sanity_lll_tests.eaidnrm.21_04_16_10_56_44_2660", "start_time": 1618563404000, "proj_name": "llll", "id": -885333331, "block_name": "m2e"}
{"owner": "eaidnrm", "last_update": 1619692604785, "failed_runs": 14,  "total_runs_in_session": 115, "end_time": 1619692605000, "other_runs": 0, "passed_runs": 101, "session_status": "completed", "name": "m2e_sanity_lll_tests.eaidnrm.21_04_21_10_54_28_3013", "start_time": 1618995268000, "proj_name": "lll", "id": 1610750080, "block_name": "m2e"}

The fact is that I have a lot of documents and I just check 9 of them for test. The id and start_time may or may not be the same in the documents. Thanks.

leandrojmp · January 24, 2022, 3:08pm

I could not replicate your issue, using those two events, I've got two different fingerprints, which would result in two different documents.

Can you share the 9 documents you are using for testing that is giving the same fingerprint?

Also, to create the fingerprint you should use fields that are present in every document, if you have documents that do not have the fields id and start_time, each one of those documents will have the same fingerprint and they will be the same document for Elasticsearch.

MKH · January 24, 2022, 4:26pm

I am sorry! I though you just need an example. Here is my nine documents.

{"owner": "me”, "last_update": 1618568744821, "failed_runs": 20, "total_runs_in_session": 115, "end_time": 1618568744000, "other_runs": 0, "passed_runs": 95, "session_status": "completed", "name": "m2e_sanity_lll_tests.me.21_04_16_10_56_44_2660", "start_time": 1618563404000, "proj_name": "lll", "id": -885333331, "block_name": "m2e"}
{"owner": "me" , "last_update": 1619692604785, "failed_runs": 14,  "total_runs_in_session": 115, "end_time": 1619692605000, "other_runs": 0, "passed_runs": 101, "session_status": "completed", "name": "m2e_sanity_lll_tests.me.21_04_21_10_54_28_3013", "start_time": 1618995268000, "proj_name": "lll", "id": 1610750080, "block_name": "m2e"}
{"owner": "me", "last_update": 1619693013587, "failed_runs": 26, "total_runs_in_session": 115, "end_time": 1619693013000, "other_runs": 0, "passed_runs": 89, "session_status": "completed", "name": "m2e_sanity_lll_tests.me.21_04_29_11_30_36_9180", "start_time": 1619688637000, "proj_name": "lll", "id": 778680953, "block_name": "m2e"}
{"owner": "me", "last_update": 1619769156159, "failed_runs": 0, "total_runs_in_session": 115, "end_time": 1619769156000, "other_runs": 0, "passed_runs": 115, "session_status": "completed", "name": "m2e_sanity_lll_tests.me.21_04_29_12_30_24_9666", "start_time": 1619692225000, "proj_name": "lll", "id": -36993562, "block_name": "m2e"}
{"owner": "me",  "last_update": 1621385045688, "failed_runs": 0, "total_runs_in_session": 3203, "end_time": 1621385045000, "other_runs": 0, "passed_runs": 3203, "session_status": "completed", "name": "fe_ipol2_lll.21_05_18_14_31_31_3373", "start_time": 1621341091000, "proj_name": "lll", "id": 1620965776, "block_name": "fe_ipol2"}
{"owner": "me",  "last_update": 1623140819306, "failed_runs": 1, "total_runs_in_session": 1, "end_time": 1623140819000, "other_runs": 0, "passed_runs": 0, "session_status": "completed", "name": "qqq.ppp.21_06_08_10_23_58_6841", "start_time": 1623140638000, "proj_name": "lll", "id": -1761078124, "block_name": "me_core"}
{"owner": "me",  "last_update": 1623145195588, "failed_runs": 1,  "total_runs_in_session": 1, "end_time": 1623145195000, "other_runs": 0, "passed_runs": 0, "session_status": "completed", "name": "qqq.ppp.21_06_08_11_39_03_8326", "start_time": 1623145144000, "proj_name": "lll", "id": -73975944, "block_name": "me_core"}
{"owner": "me",  "last_update": 1623150868966, "failed_runs": 1, "total_runs_in_session": 1, "end_time": 1623150868000, "other_runs": 0, "passed_runs": 0, "session_status": "completed", "name": "qqq.ppp.21_06_08_13_01_06_5601", "start_time": 1623150066000, "proj_name": "lll", "id": -1860702575, "block_name": "me_core"}
{"owner": "me", "last_date": 1623234844584, "failed_runs": 725, "total_runs_in_session": 1404, "end_time": 1623234844000, "other_runs": 0, "passed_runs": 679, "session_status": "completed", "name":"m2e_DPD_SC_Fun_Cov_tests_1h.me.21_06_09_09_26_45_2269", "start_time": 1623223605000, "proj_name": "lll", "id": -1847479680, "block_name": "m2e"}

I have fileds "id" and "start_time" in all my documents. Please skip what I said in my previous post about id and start_time. The only thing I should mention is that the start_time may be the same in some documents (not in this 9 documents but in my real application).
Thanks

MKH · January 25, 2022, 2:16pm

Could someone please help on this problem. I really need help.
Thanks

leandrojmp · January 25, 2022, 3:07pm

I also could not replicate your issue using those sample documents you shared.

This is what I got:

{
     "start_time" => 1618563404000,
             "id" => -885333331,
    "fingerprint" => 2690292881
}
{
     "start_time" => 1618995268000,
             "id" => 1610750080,
    "fingerprint" => 3895322953
}
{
     "start_time" => 1619688637000,
             "id" => 778680953,
    "fingerprint" => 1264227221
}
{
     "start_time" => 1619692225000,
             "id" => -36993562,
    "fingerprint" => 2267954862
}
{
     "start_time" => 1621341091000,
             "id" => 1620965776,
    "fingerprint" => 1535174955
}
{
     "start_time" => 1623140638000,
             "id" => -1761078124,
    "fingerprint" => 3623036441
}
{
     "start_time" => 1623145144000,
             "id" => -73975944,
    "fingerprint" => 3688766946
}
{
     "start_time" => 1623150066000,
             "id" => -1860702575,
    "fingerprint" => 1893627674
}
{
     "start_time" => 1623223605000,
             "id" => -1847479680,
    "fingerprint" => 2935393096
}

As you can see, every fingerprint is different, which would result in different documents, it works as expected for me.

Can you share the complete indexed document as well? Is the document correctly parsed?

MKH · January 25, 2022, 3:53pm

Here is an image of my indexed data. To pars the data I used logstsh -f command in terminal and I passed my config file (which I have posted the content of it here): This is the command I use:
./bin/logstash -f ./config/myconfig.conf

May I ask if you have used the same filters as mine? Probably I am doing something wrong. I am just not sure what it is?
I use logstash 7.10.0 and I just realized that when I try to index my documents I have this lone in the logs generated with logstash:

logstash-7.10.0/vendor/bundle/jruby/2.5.0/gems/logstash-filter-fingerprint-3.2.2/lib/logstash/filters/fingerprint.rb:195: warning: constant ::Fixnum is deprecated

I am not sure if this message has anything related to my problem?
Thanks.

leandrojmp · January 25, 2022, 4:43pm

Oh, I'm sorry, the error was pretty clear from your first post, for some reason I didn't catch it right away.

This is the issue:

document_id => "[@metadata][fingerprint]"

It should be

document_id => "%{[@metadata][fingerprint]}"

Your fingerprint is correct, the reference to the field is wrong.

The correct way to referecen the value of a field is using the sprintf format, which is %{[fieldName]}.

Since you had document_id equals to [@metadata][fingerprint], the literal value of [@metadata][fingerprint] was being used as the id for your document, this is pretty clear in the screenshot you shared.

stephenb · January 25, 2022, 5:19pm

And why are you even putting in the metadata field...

filter {
    fingerprint {
        concatenate_sources => true
        source => ["id", "start_time"]
        target => "fingerprint"
        method => "MURMUR3"
    }
}
...

output {
    elasticsearch{
        hosts => "muhost:port"
        user => "myname"
        password => "***"
        index=> myindex 
        document_type => "_doc"
        document_id => "%{fingerprint}"
        action => 'update'        <!--- If you will be updating records
        doc_as_upsert => true     <!--- If you will be inserting and / or updating records
    }
}

Badger · January 25, 2022, 7:00pm

Why would you not do that? Why index the [fingerprint] field when it is already available on the document as _id? It is redundant.

stephenb · January 25, 2022, 7:35pm

@Badger Good Point

I was not suggesting whether or not to index the fingerprint as a field... so sure put in a metadata field... my point was that the OP was having issues using the syntax correctly and debugging the issue, so a simple field perhaps was easier to understand then putting it in the meta data field.

I supposed it would be more efficient but for debugging purposes it may have been easier to use a simple fields that was also indexed for debugging... at least that was my thinking.

What ever make most sense / easy,,,

MKH · January 25, 2022, 9:47pm

The id is unique in each document I have but it is an attribute of my data and not the index_id. I want to generate the id using this id and that is why I use fingerprint. Is there any other way to do that? Using fingerprints was the way I found to prevent redundant data in logstash documents.

leandrojmp · January 25, 2022, 9:50pm

The fingerprint filter is the correct way to do that, the point that Stephen talked about is that using the [@metadata][fingerprint] field can interfere in the troubleshooting and debugging since the [@metadata] field is not present in the output.

But as we can see, the id of your document was the literal value [@metadata][fingerprint], this happened because the referecen in the document_id option was wrong, it should use the sprintf format as you can see in my last post.

MKH · January 25, 2022, 9:53pm

@leandrojmp Thank you so much for helping me on this issue. Unfortunately I have had NW problems all day today on my side and I could not test your final suggestion. that is why I did not responded it.

leandrojmp · January 25, 2022, 9:55pm

No problem, it should work when you test as using %{[@metadata][fingerprint]} will use the value of the field.

For example If the value of the field [@metadata][fingerprint] is 712368127, this will be used as the _id value for the Elasticsearch document.

MKH · February 2, 2022, 9:23pm

I just wanted to thank @stephenb for the helpful support. And for the others who may face the same problem, I share with you that @stephenb last suggestion has solved my problem. Now my fingerprint works properly.

system · March 2, 2022, 9:23pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Fingerprint option in logstash filter not working properly? Logstash	10	935	February 21, 2022
Fingerprint issue Logstash	3	1394	July 6, 2017
Logstash fingerprint 7.1 Logstash	6	1401	July 4, 2019
Fingerprint plugin not working properly Logstash	5	169	April 5, 2024
Use uuid or fingerprint for document_id? Logstash	3	1955	July 6, 2017

Problem using fingerprint for my documents stored in elasticsearch

Related topics