Hi All,
Looking for a little guidance here, I've got a static ES index with 2.7 billion documents in it, I'm using Logstash to re-index that to a new empty index with an added fingerprint field so I can deduplicate the data (The LS strategy in the elastic blog doesn't work for me due to the insert performance hit, at about 10% through my dataset we're down to 500 documents per second which is untenable).
The weird thing is that my new index which was completely empty when I started the process now has 3.1 billion documents in it (I suspect it would still be growing but I've hit the disk space watermark and the index has switched to read-only).
This is my logstash pipeline;
input {
elasticsearch {
hosts => "localhost"
index => "source-index"
query => '{ "sort": [ "_doc"] }'
}
}
filter {
fingerprint {
key => "sjCq5E7M"
method => "SHA256"
source => ["field1", "field2"]
concatenate_sources => "true"
target => "hash"
}
mutate {
remove_field => ["@version", "@timestamp"]
}
}
output {
stdout { codec => dots }
elasticsearch {
index => "dest-index"
}
}
The only thing I can think is that the ES input plugin has pulled the same record repeatedly, how does it keep track of where it is within an index?
If it's a per-session thing that's not going to be the problem here as the process I followed was;
- Stop Logstash
- Delete destination index (was created for testing the pipeline)
- Re-create destination index
- Start Logstash
So the question is why are there more records out than in?
And how do I actually know that Logstash has processed the entire source index?
I can run the de-duplication of the destination index "on the fly" to work around the disk space, but if Logstash is going to continue to create new duplicate records I'm basically Sisyphus pushing the rock up hill only to have it roll down again...
Assistance would be greatly appreciated.
Thanks,
-J
