Problem using fingerprint for my documents stored in elasticsearch

Hi,
I want to provide a unique id for the data I store in Elasticsearch index using fingerprint. But when I feed a bulk of docs to logstash I only one of the documents is stored in the index. I have 9 docs (9 json dictionary) stored in one file (myfile.json) and this file is the input to logstash. Here is my logstash config file:

input {
    file {
        start_position => "beginning"
        path => "/my_repo/myfile.json"
        sincedb_path => "/dev/null"     
    }
}
filter {
    json{
        source => "message"
    }
}
. . .
filter {
    date {
        match => ["start_time", 'UNIX_MS']
        target => "@start_time"
    }
}
. . .
filter {
    fingerprint {
        concatenate_sources => true
        source => ["id", "start_time"]
        target => "[@metadata][fingerprint]"
        method => "MURMUR3"
    }
}
output {
    elasticsearch{
        hosts => "muhost:port"
        user => "myname"
        password => "***"
        index=> myindex 
        document_id => "[@metadata][fingerprint]"        
    }
}

attribute "id" is the same in all documents but "start_time" is different in each document and I want to use a combination of both the provide a unique for each of the documents. When I feed the data to my index I see only one document and only the version of that document increases by 9 each time I do a new write (I use the same file with the same document to test how it works). I think there is something wrong in my fingerprint but I cannot understand what? Can someone please help?

Thanks.

Can you share a sample of the documents with the fields id and start_time as they appear in the json?

With this it is possible to try to simulate and pipeline and see what is wrong.

Hi leandrojmp,

Thanks for your answer. Here is the sample of my documents:

{"owner": "me", "last_update": 1618568744821, "failed_runs": 20,  "total_runs_in_session": 115, "end_time": 1618568744000, "other_runs": 0, "passed_runs": 95, "session_status": "completed", "name": "m2e_sanity_lll_tests.eaidnrm.21_04_16_10_56_44_2660", "start_time": 1618563404000, "proj_name": "llll", "id": -885333331, "block_name": "m2e"}
{"owner": "eaidnrm", "last_update": 1619692604785, "failed_runs": 14,  "total_runs_in_session": 115, "end_time": 1619692605000, "other_runs": 0, "passed_runs": 101, "session_status": "completed", "name": "m2e_sanity_lll_tests.eaidnrm.21_04_21_10_54_28_3013", "start_time": 1618995268000, "proj_name": "lll", "id": 1610750080, "block_name": "m2e"}

The fact is that I have a lot of documents and I just check 9 of them for test. The id and start_time may or may not be the same in the documents. Thanks.

I could not replicate your issue, using those two events, I've got two different fingerprints, which would result in two different documents.

Can you share the 9 documents you are using for testing that is giving the same fingerprint?

Also, to create the fingerprint you should use fields that are present in every document, if you have documents that do not have the fields id and start_time, each one of those documents will have the same fingerprint and they will be the same document for Elasticsearch.

I am sorry! I though you just need an example. Here is my nine documents.

{"owner": "me”, "last_update": 1618568744821, "failed_runs": 20, "total_runs_in_session": 115, "end_time": 1618568744000, "other_runs": 0, "passed_runs": 95, "session_status": "completed", "name": "m2e_sanity_lll_tests.me.21_04_16_10_56_44_2660", "start_time": 1618563404000, "proj_name": "lll", "id": -885333331, "block_name": "m2e"}
{"owner": "me" , "last_update": 1619692604785, "failed_runs": 14,  "total_runs_in_session": 115, "end_time": 1619692605000, "other_runs": 0, "passed_runs": 101, "session_status": "completed", "name": "m2e_sanity_lll_tests.me.21_04_21_10_54_28_3013", "start_time": 1618995268000, "proj_name": "lll", "id": 1610750080, "block_name": "m2e"}
{"owner": "me", "last_update": 1619693013587, "failed_runs": 26, "total_runs_in_session": 115, "end_time": 1619693013000, "other_runs": 0, "passed_runs": 89, "session_status": "completed", "name": "m2e_sanity_lll_tests.me.21_04_29_11_30_36_9180", "start_time": 1619688637000, "proj_name": "lll", "id": 778680953, "block_name": "m2e"}
{"owner": "me", "last_update": 1619769156159, "failed_runs": 0, "total_runs_in_session": 115, "end_time": 1619769156000, "other_runs": 0, "passed_runs": 115, "session_status": "completed", "name": "m2e_sanity_lll_tests.me.21_04_29_12_30_24_9666", "start_time": 1619692225000, "proj_name": "lll", "id": -36993562, "block_name": "m2e"}
{"owner": "me",  "last_update": 1621385045688, "failed_runs": 0, "total_runs_in_session": 3203, "end_time": 1621385045000, "other_runs": 0, "passed_runs": 3203, "session_status": "completed", "name": "fe_ipol2_lll.21_05_18_14_31_31_3373", "start_time": 1621341091000, "proj_name": "lll", "id": 1620965776, "block_name": "fe_ipol2"}
{"owner": "me",  "last_update": 1623140819306, "failed_runs": 1, "total_runs_in_session": 1, "end_time": 1623140819000, "other_runs": 0, "passed_runs": 0, "session_status": "completed", "name": "qqq.ppp.21_06_08_10_23_58_6841", "start_time": 1623140638000, "proj_name": "lll", "id": -1761078124, "block_name": "me_core"}
{"owner": "me",  "last_update": 1623145195588, "failed_runs": 1,  "total_runs_in_session": 1, "end_time": 1623145195000, "other_runs": 0, "passed_runs": 0, "session_status": "completed", "name": "qqq.ppp.21_06_08_11_39_03_8326", "start_time": 1623145144000, "proj_name": "lll", "id": -73975944, "block_name": "me_core"}
{"owner": "me",  "last_update": 1623150868966, "failed_runs": 1, "total_runs_in_session": 1, "end_time": 1623150868000, "other_runs": 0, "passed_runs": 0, "session_status": "completed", "name": "qqq.ppp.21_06_08_13_01_06_5601", "start_time": 1623150066000, "proj_name": "lll", "id": -1860702575, "block_name": "me_core"}
{"owner": "me", "last_date": 1623234844584, "failed_runs": 725, "total_runs_in_session": 1404, "end_time": 1623234844000, "other_runs": 0, "passed_runs": 679, "session_status": "completed", "name":"m2e_DPD_SC_Fun_Cov_tests_1h.me.21_06_09_09_26_45_2269", "start_time": 1623223605000, "proj_name": "lll", "id": -1847479680, "block_name": "m2e"}

I have fileds "id" and "start_time" in all my documents. Please skip what I said in my previous post about id and start_time. The only thing I should mention is that the start_time may be the same in some documents (not in this 9 documents but in my real application).
Thanks

Could someone please help on this problem. I really need help.
Thanks

I also could not replicate your issue using those sample documents you shared.

This is what I got:

{
     "start_time" => 1618563404000,
             "id" => -885333331,
    "fingerprint" => 2690292881
}
{
     "start_time" => 1618995268000,
             "id" => 1610750080,
    "fingerprint" => 3895322953
}
{
     "start_time" => 1619688637000,
             "id" => 778680953,
    "fingerprint" => 1264227221
}
{
     "start_time" => 1619692225000,
             "id" => -36993562,
    "fingerprint" => 2267954862
}
{
     "start_time" => 1621341091000,
             "id" => 1620965776,
    "fingerprint" => 1535174955
}
{
     "start_time" => 1623140638000,
             "id" => -1761078124,
    "fingerprint" => 3623036441
}
{
     "start_time" => 1623145144000,
             "id" => -73975944,
    "fingerprint" => 3688766946
}
{
     "start_time" => 1623150066000,
             "id" => -1860702575,
    "fingerprint" => 1893627674
}
{
     "start_time" => 1623223605000,
             "id" => -1847479680,
    "fingerprint" => 2935393096
}

As you can see, every fingerprint is different, which would result in different documents, it works as expected for me.

Can you share the complete indexed document as well? Is the document correctly parsed?

Here is an image of my indexed data. To pars the data I used logstsh -f command in terminal and I passed my config file (which I have posted the content of it here): This is the command I use:
./bin/logstash -f ./config/myconfig.conf


May I ask if you have used the same filters as mine? Probably I am doing something wrong. I am just not sure what it is?
I use logstash 7.10.0 and I just realized that when I try to index my documents I have this lone in the logs generated with logstash:
logstash-7.10.0/vendor/bundle/jruby/2.5.0/gems/logstash-filter-fingerprint-3.2.2/lib/logstash/filters/fingerprint.rb:195: warning: constant ::Fixnum is deprecated
I am not sure if this message has anything related to my problem?
Thanks.

Oh, I'm sorry, the error was pretty clear from your first post, for some reason I didn't catch it right away.

This is the issue:

document_id => "[@metadata][fingerprint]" 

It should be

document_id => "%{[@metadata][fingerprint]}" 

Your fingerprint is correct, the reference to the field is wrong.

The correct way to referecen the value of a field is using the sprintf format, which is %{[fieldName]}.

Since you had document_id equals to [@metadata][fingerprint], the literal value of [@metadata][fingerprint] was being used as the id for your document, this is pretty clear in the screenshot you shared.

3 Likes

And why are you even putting in the metadata field...

filter {
    fingerprint {
        concatenate_sources => true
        source => ["id", "start_time"]
        target => "fingerprint"
        method => "MURMUR3"
    }
}
...

output {
    elasticsearch{
        hosts => "muhost:port"
        user => "myname"
        password => "***"
        index=> myindex 
        document_type => "_doc"
        document_id => "%{fingerprint}"
        action => 'update'        <!--- If you will be updating records
        doc_as_upsert => true     <!--- If you will be inserting and / or updating records
    }
}
1 Like

Why would you not do that? Why index the [fingerprint] field when it is already available on the document as _id? It is redundant.

2 Likes

@Badger Good Point

I was not suggesting whether or not to index the fingerprint as a field... so sure put in a metadata field... my point was that the OP was having issues using the syntax correctly and debugging the issue, so a simple field perhaps was easier to understand then putting it in the meta data field.

I supposed it would be more efficient but for debugging purposes it may have been easier to use a simple fields that was also indexed for debugging... at least that was my thinking.

What ever make most sense / easy,,,

1 Like

The id is unique in each document I have but it is an attribute of my data and not the index_id. I want to generate the id using this id and that is why I use fingerprint. Is there any other way to do that? Using fingerprints was the way I found to prevent redundant data in logstash documents.

The fingerprint filter is the correct way to do that, the point that Stephen talked about is that using the [@metadata][fingerprint] field can interfere in the troubleshooting and debugging since the [@metadata] field is not present in the output.

But as we can see, the id of your document was the literal value [@metadata][fingerprint], this happened because the referecen in the document_id option was wrong, it should use the sprintf format as you can see in my last post.

@leandrojmp Thank you so much for helping me on this issue. Unfortunately I have had NW problems all day today on my side and I could not test your final suggestion. that is why I did not responded it.

No problem, it should work when you test as using %{[@metadata][fingerprint]} will use the value of the field.

For example If the value of the field [@metadata][fingerprint] is 712368127, this will be used as the _id value for the Elasticsearch document.

I just wanted to thank @stephenb for the helpful support. And for the others who may face the same problem, I share with you that @stephenb last suggestion has solved my problem. Now my fingerprint works properly.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.