Strange issue with bulk(?) interface

ElasticSearch 6.3.4
Logstash 6.6.0

I'm in the process of ingesting a large amount of data into an ES index (several billion records), there's a lot of duplication in my input dataset so the first block was imported to an index and is presently being churned into a different index by the logstash + fingerprint method for deduplication, i.e. I generate a hash of the two fields I'm interested in and use that as the ID for the document in the new index (code lifted more-or-less verbatim from https://www.elastic.co/blog/how-to-find-and-remove-duplicate-documents-in-elasticsearch).

Because of the size of the source dataset I'm seeking to do the deduplication + ingestion in a single step (took a week or so for the first block to import and so far the deduplication operation has been running 5 days), in aid of that I've built a logstash pipeline which uses a grok filter to extract the fields directly from the source files (was previously using a python script) then uses the same fingerprint filter to generate a new id for import.

This all SEEMED to be working OK, but I've run into a very strange issue.

Some background:
Some of the lines in the source data are duplicated verbatim, that's no problem, but some of them have a common value for one field but a different value for another field (these are the two fields in my fingerprint filter) i.e.;
fielda: somevalue fieldb: onevalue
fielda: somevalue fieldb: anothervalue
fielda: somevalue fieldb: athirdvalue

If I feed a file containing JUST data like that above (three actual records from one of my input files which have the same fielda value and different fieldb values) everything works as expected, I end up with three records in ES which represent the three different pairs of values.

The weird part is when I import a larger set which includes those same three entries I only get ONE record in ES for that particular fielda value.

I've validated the logstash pipeline by creating a parallel "file" output which receives the same values as the "elasticsearch" output, the file contains the three separate records as expected (and the generated IDs of those are all unique) but ES only contains one.

A "real" example;

  • Source file contains 921 lines
  • 616 of those lines are unique (per cat [file] | sort | uniq | wc -l)
  • Within those 616 lines there are many which have the same value in fielda but whose fielda+fieldb is unique.
  • I expect 616 records in ES, instead I get 147 and 469 deleted.

So it appears that on the large batch ES is NOT handling the provided ID properly (in spite of the correct value appearing in _id) and is instead keying off solely the value of fielda.

Potentially relevant sidenote, if I import the three line test document (resulting in the 3 expected entries in ES) THEN import the 921 line document, I'm left with only 1 document with the fielda value in question which tends to support the above suspicion.

I've a nasty feeling this may be a bug and as such it's probably also breaking my current deduplication/reindexing operation...

If anybody can shed some light I'd be very grateful.

Thanks

-J

Please show examples of your data as well as your config.

What config are you looking for? As I say I've validated the logstash pipeline already. Do you want index config or ES config or what (sorry, being new around here I'm not sure what's required for troubleshooting)?

If you're after the LS pipeline (though as I say it's confirmed working with the file output), here are the relevant parts is;

input {
    file {
<snip>
    }
}

filter {
<snip>
	# match against custom pattern
    grok {
        patterns_dir => ["/etc/logstash/patterns"]
        patterns_files_glob => "*"
        match => { "message" => "%{CUSTOMPATTERN}" }
    }
<snip>
    }
	# generate fingerprint for id
    fingerprint {
        key => "XXXXXXX"
        method => "SHA256"
        source => ["fielda", "fieldb"]
        target => "[@metadata][generated_id]"
    }
<snip>
}

output {
    if [fielda] and [fieldb] {
        elasticsearch {
            index => "indexa"
            hosts => ["XXXX:9200"]
            document_id => "%{[@metadata][generated_id]}"
	    # deprecated but LS complains if we don't have it
            document_type => "customtype"
        }
        file {
            path => "/opt/ingestion/test.txt"
            dir_mode => 0775
            file_mode => 0664
            codec => line { format => "fielda: %{fielda} fieldb: %{fieldb} id: %{hash_id}" }
            write_behavior => "overwrite"
        }
    }
<snip>
}

I can't share the data as it contains sensitive information but it should be possible to cook up a "clean" dataset which will replicate the issue, might take a day or so to get to that though.

I would help to see actual records as well as the actual pattern you are using to extract data. Is it possible that the grok pattern is not working as you expect and that fielda and fieldb are not getting populated for all records? Do you see the expected number of records in the file input that is within the conditional?

The grok filter is working fine.

I see all the records I'm expecting to see in /opt/ingestion/test.txt (921 lines corresponding to the 921 lines in the source, 616 unique corresponding to the 616 unique lines in the source) this is how I validated my LS pipeline and the Grok pattern.

As noted though, on the same run against an empty ES index I see 147 docs and 469 deleted.docs 147+469 = 616 so they're all getting to ES but for whatever reason when it encounters a record with the same fielda value as an existing document, whether it's got a different fieldb value or not, it deletes the existing document and adds the new record to the index (the one remaining record for that particular fielda value is the last one which appears in the input file).

Feeding the same pipeline a file containing JUST three records, all with the same fielda value and different fieldb values (extracted by "grep [fieldavalue] [originalfile] > [testfile]" so the data is exactly as it was in the original file) results in the three records I'm expecting in ES so it seems that LS is using a different mechanism for smaller size batches which is not exhibiting this issue (thus the bulk(?) in the subject).

If you concatenate fielda and fieldb into a separate metadata field and base the fingerprint on this, does this change anything?

Interesting, I added;

     mutate {
         add_field => [ "[@metadata][combinedfield]", "%{fielda}%{fieldb}" ]
     }

Above the filter and changed it to fingerprint on [@metadata][combinedfield] and now I'm seeing 613 records in the index (the three "missing" ones probably have some extra spaces around the separator or similar (this dataset is quite inconsistent) so they show as unique with uniq but not when you take just the values)

Ah, I think I've found the issue, turns out it IS in LS not in ES after all, if I remove fielda from the original filter (fieldb is duplicated across some records too) I get the same results (and the generated IDs look to be the same), so it appears that the fingerprint was only being applied to the second field in the list rather than all the fields in the list... Which would explain the inconsistencies I was seeing.

The fact I wasn't seeing it with the smaller test dataset was down to not having duplicate fieldb values in there.

Unfortunately that means I need to kill the existing re-indexing operation (6 days in :/) and re-start from scratch.

Confirmed;

source => [ "fielda", "fieldb" ]

Yields exactly the same (identical fingerprint/id value) result as

source => [ "fieldb" ]

I think the problem might have been that you did not set ‘concatenate_sources’ in the filter.

Ah, I assume that should be set in the fingerprint filter? And if it's not it does the fingerprint first based on the first field, then based on the second?

Seems like somewhat broken behaviour (i.e. why would I supply multiple sources if I didn't want the operation to occur on all of them)...

UPDATE:

I've just tested the original pipeline with;

concatenate_sources => "true"

added to the filter stanza and I'm seeing the result I expect now.

Since it's required to perform the task as expected I'd suggest getting in touch with the owner of the blog post (https://www.elastic.co/blog/how-to-find-and-remove-duplicate-documents-in-elasticsearch) and getting them to update the code in there, to save anybody else having the same issue in future.

1 Like

Ah, reading the docs I now understand the intention.

Theoretically it was returning an array of fingerprints, one calculated on each of the source fields, not sure why the one that was getting spat out into my generated_id metadata field was the last in the array. Would've expected, if anything to get the first.

shrug

Still probably best to update that blog post though, since it's not going to do what it's expected to do :wink:

Apparently this is a known issue first reported September 24 2015...

And it's not behaving as the documentation says it should apparently I would've been seeing something different if it were actually returning an array, turns out it's just fingerprinting the last field.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.