I'm in the process of ingesting a large amount of data into an ES index (several billion records), there's a lot of duplication in my input dataset so the first block was imported to an index and is presently being churned into a different index by the logstash + fingerprint method for deduplication, i.e. I generate a hash of the two fields I'm interested in and use that as the ID for the document in the new index (code lifted more-or-less verbatim from https://www.elastic.co/blog/how-to-find-and-remove-duplicate-documents-in-elasticsearch).
Because of the size of the source dataset I'm seeking to do the deduplication + ingestion in a single step (took a week or so for the first block to import and so far the deduplication operation has been running 5 days), in aid of that I've built a logstash pipeline which uses a grok filter to extract the fields directly from the source files (was previously using a python script) then uses the same fingerprint filter to generate a new id for import.
This all SEEMED to be working OK, but I've run into a very strange issue.
Some of the lines in the source data are duplicated verbatim, that's no problem, but some of them have a common value for one field but a different value for another field (these are the two fields in my fingerprint filter) i.e.;
fielda: somevalue fieldb: onevalue
fielda: somevalue fieldb: anothervalue
fielda: somevalue fieldb: athirdvalue
If I feed a file containing JUST data like that above (three actual records from one of my input files which have the same fielda value and different fieldb values) everything works as expected, I end up with three records in ES which represent the three different pairs of values.
The weird part is when I import a larger set which includes those same three entries I only get ONE record in ES for that particular fielda value.
I've validated the logstash pipeline by creating a parallel "file" output which receives the same values as the "elasticsearch" output, the file contains the three separate records as expected (and the generated IDs of those are all unique) but ES only contains one.
A "real" example;
- Source file contains 921 lines
- 616 of those lines are unique (per cat [file] | sort | uniq | wc -l)
- Within those 616 lines there are many which have the same value in fielda but whose fielda+fieldb is unique.
- I expect 616 records in ES, instead I get 147 and 469 deleted.
So it appears that on the large batch ES is NOT handling the provided ID properly (in spite of the correct value appearing in _id) and is instead keying off solely the value of fielda.
Potentially relevant sidenote, if I import the three line test document (resulting in the 3 expected entries in ES) THEN import the 921 line document, I'm left with only 1 document with the fielda value in question which tends to support the above suspicion.
I've a nasty feeling this may be a bug and as such it's probably also breaking my current deduplication/reindexing operation...
If anybody can shed some light I'd be very grateful.