Looking for a little guidance here, I've got a static ES index with 2.7 billion documents in it, I'm using Logstash to re-index that to a new empty index with an added fingerprint field so I can deduplicate the data (The LS strategy in the elastic blog doesn't work for me due to the insert performance hit, at about 10% through my dataset we're down to 500 documents per second which is untenable).
The weird thing is that my new index which was completely empty when I started the process now has 3.1 billion documents in it (I suspect it would still be growing but I've hit the disk space watermark and the index has switched to read-only).
The only thing I can think is that the ES input plugin has pulled the same record repeatedly, how does it keep track of where it is within an index?
If it's a per-session thing that's not going to be the problem here as the process I followed was;
Stop Logstash
Delete destination index (was created for testing the pipeline)
Re-create destination index
Start Logstash
So the question is why are there more records out than in?
And how do I actually know that Logstash has processed the entire source index?
I can run the de-duplication of the destination index "on the fly" to work around the disk space, but if Logstash is going to continue to create new duplicate records I'm basically Sisyphus pushing the rock up hill only to have it roll down again...
OK, based on the configuration parameters available in the ES input plugin it looks to me like it's using a scroll to extract the records.
As such, I knocked up a quick python script to scroll the entire contents of my index 10k records at a time and sum the total, the number of records my script counted matched the number of documents in the index perfectly (15 odd hours later).
I thought LS was maybe losing its search context somewhere along the line, so I tweaked the parameters to increase the search context lifetime to 10m and pull 10k records per request.
Then I purged the index and started logstash again, monitoring the ES logs throughout the process (as a missing search context would show up there), there were no errors reported in the ES log (or the LS log for that matter) but this morning I find that my target index once again has more documents (only 40 million over before I stopped LS this time at least) than my source index.
Is it just that once LS gets to the end of the search it's starting again with the same search maybe? If that is the case then it'll just be an additional 40 million duplicates to clean up which in the grand scheme of things is not a major problem, but if this is not the case then I still need to figure out what's going on.
I'd really appreciate any insight anyone can offer here, I've been fighting this for three weeks now and I'd like to finally put it to bed.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.