How to avoid duplicates before entering the filter plugin?

Nani_20 · May 28, 2020, 11:58am

Hi,

I'm using a document_id and action => create in the logstash elasticsearch output filter to avoid duplicates. Since, this is in output plugin, I can see the duplicates passing through all my filter plugin operations.
Is there any way to identify the duplicate before the filter plugin and avoid it?

Any help is appreciated. Thanks in advance

elasticforme · May 28, 2020, 1:03pm

not possible if you think logically, each record comes in to filter and do not know what else has pass through it or what will come.

you can check what is saved inside your index in elk. but that means whole new record reading and comparaing with what you have originally coming in.

if you use document_id then
if document exist in elk than it will remove it and create new one.
if document does not exist than elk will create it.

Badger · May 28, 2020, 4:12pm

There are cases where it can be done upstream. For example, a jdbc input might be configured with 'SELECT DISTINCT' which would eliminate duplicates.

In the filters, if you are writing to elasticsearch you might be able to add an elasticsearch filter to query the existence of a document before processing it. But that is not cheap and may not be an optimization. You would need to benchmark both with and without.

Nani_20 · May 28, 2020, 5:56pm

I'm using a file input here. So, I believe ES filter query would be the option for me. I have lot of ruby operations within my filter so I guess ES filter query would be efficient than passing the duplicates through these ruby operations.
Any leads/suggestions on how to query this in ES filter?

Also, currently I'm using just the document_id in the ES output filter to avoid duplicates indexing into ES. But, I see fingerprint filter doing something similar. Could you enlighten me on whats the difference between these two? Is there anyway I can use fingerprint to avoid duplicates entering filter plugin instead of document_id?

Thanks in advance

Badger · May 28, 2020, 6:50pm

I do not run elasticsearch so I cannot help with configuring that part. I suggest you ask a new question about how to use an elasticsearch filter to test whether a document id already exists in an index.

No. You avoid duplicates by making sure a duplicate document has the same document_id. The fingerprint filter can be used to generate that id.

Nani_20 · May 29, 2020, 8:23am

Thanks @Badger for the reply. Will do that.

system · June 26, 2020, 8:23am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
ES query to check the existence of a document_id? Logstash	10	983	June 26, 2020
How to stop duplicate entries using elasticsearch plugin Logstash	10	6102	June 29, 2017
How not to overwrite duplicates? save old documents Logstash	3	814	July 23, 2020
Removing Duplicate documents in ElasticSearch Elasticsearch	2	362	June 11, 2019
Avoid duplication Logstash	13	4822	December 7, 2018

How to avoid duplicates before entering the filter plugin?

Related topics