Logstash ES->ES more documents out than in?

jrandombob · March 8, 2019, 10:08pm

Hi All,

Looking for a little guidance here, I've got a static ES index with 2.7 billion documents in it, I'm using Logstash to re-index that to a new empty index with an added fingerprint field so I can deduplicate the data (The LS strategy in the elastic blog doesn't work for me due to the insert performance hit, at about 10% through my dataset we're down to 500 documents per second which is untenable).

The weird thing is that my new index which was completely empty when I started the process now has 3.1 billion documents in it (I suspect it would still be growing but I've hit the disk space watermark and the index has switched to read-only).

This is my logstash pipeline;

input {
    elasticsearch {
        hosts => "localhost"
        index => "source-index"
        query => '{ "sort": [ "_doc"] }'
    }
}
filter {
    fingerprint {
        key => "sjCq5E7M"
        method => "SHA256"
        source => ["field1", "field2"]
        concatenate_sources => "true"
        target => "hash"
    }
    mutate {
        remove_field => ["@version", "@timestamp"]
    }
}
output {
    stdout { codec => dots }
    elasticsearch {
        index => "dest-index"
    }
}

The only thing I can think is that the ES input plugin has pulled the same record repeatedly, how does it keep track of where it is within an index?

If it's a per-session thing that's not going to be the problem here as the process I followed was;

Stop Logstash
Delete destination index (was created for testing the pipeline)
Re-create destination index
Start Logstash

So the question is why are there more records out than in?
And how do I actually know that Logstash has processed the entire source index?

I can run the de-duplication of the destination index "on the fly" to work around the disk space, but if Logstash is going to continue to create new duplicate records I'm basically Sisyphus pushing the rock up hill only to have it roll down again...

Assistance would be greatly appreciated.

Thanks,

-J

jrandombob · March 17, 2019, 9:35pm

OK, based on the configuration parameters available in the ES input plugin it looks to me like it's using a scroll to extract the records.

As such, I knocked up a quick python script to scroll the entire contents of my index 10k records at a time and sum the total, the number of records my script counted matched the number of documents in the index perfectly (15 odd hours later).

I thought LS was maybe losing its search context somewhere along the line, so I tweaked the parameters to increase the search context lifetime to 10m and pull 10k records per request.

Then I purged the index and started logstash again, monitoring the ES logs throughout the process (as a missing search context would show up there), there were no errors reported in the ES log (or the LS log for that matter) but this morning I find that my target index once again has more documents (only 40 million over before I stopped LS this time at least) than my source index.

Is it just that once LS gets to the end of the search it's starting again with the same search maybe? If that is the case then it'll just be an additional 40 million duplicates to clean up which in the grand scheme of things is not a major problem, but if this is not the case then I still need to figure out what's going on.

I'd really appreciate any insight anyone can offer here, I've been fighting this for three weeks now and I'd like to finally put it to bed.

jrandombob · March 18, 2019, 10:24am

Hmm, according to the doco though, since I don't have a schedule configured "... then the statement is run exactly once."

https://www.elastic.co/guide/en/logstash/current/plugins-inputs-elasticsearch.html#plugins-inputs-elasticsearch-schedule

Which brings us back to the question of why it is that I'm getting more records in my output index than I had in my input index

jrandombob · March 19, 2019, 3:08am

In the absence of any other suggestions, I'm trying a different "return all" query

'{ "query": { "query_string": { "query": "*" }}}'

hopefully this will behave as I expect it to...

jrandombob · March 21, 2019, 6:41am

Aand, nope... Still more records out than in...

system · April 18, 2019, 6:41am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch input plugin creates more documents than what is in the originating index Logstash	2	533	November 12, 2019
How to avoid elasticsearch duplicate documents Logstash	6	1708	March 5, 2018
Logstash with elasticsearch input and output keep looping results forever Logstash	3	901	March 22, 2019
Will logstash duplicate already indexed data in elasticsearch? Logstash	2	1148	July 6, 2017
How Logstash reparse Elasticsearch input Logstash	2	662	July 6, 2017

Logstash ES->ES more documents out than in?

Related topics