How to avoid duplicates before entering the filter plugin?

Hi,

I'm using a document_id and action => create in the logstash elasticsearch output filter to avoid duplicates. Since, this is in output plugin, I can see the duplicates passing through all my filter plugin operations.
Is there any way to identify the duplicate before the filter plugin and avoid it?

Any help is appreciated. Thanks in advance

not possible if you think logically, each record comes in to filter and do not know what else has pass through it or what will come.

you can check what is saved inside your index in elk. but that means whole new record reading and comparaing with what you have originally coming in.

if you use document_id then
if document exist in elk than it will remove it and create new one.
if document does not exist than elk will create it.

There are cases where it can be done upstream. For example, a jdbc input might be configured with 'SELECT DISTINCT' which would eliminate duplicates.

In the filters, if you are writing to elasticsearch you might be able to add an elasticsearch filter to query the existence of a document before processing it. But that is not cheap and may not be an optimization. You would need to benchmark both with and without.

I'm using a file input here. So, I believe ES filter query would be the option for me. I have lot of ruby operations within my filter so I guess ES filter query would be efficient than passing the duplicates through these ruby operations.
Any leads/suggestions on how to query this in ES filter?

Also, currently I'm using just the document_id in the ES output filter to avoid duplicates indexing into ES. But, I see fingerprint filter doing something similar. Could you enlighten me on whats the difference between these two? Is there anyway I can use fingerprint to avoid duplicates entering filter plugin instead of document_id?

Thanks in advance

I do not run elasticsearch so I cannot help with configuring that part. I suggest you ask a new question about how to use an elasticsearch filter to test whether a document id already exists in an index.

No. You avoid duplicates by making sure a duplicate document has the same document_id. The fingerprint filter can be used to generate that id.

Thanks @Badger for the reply. Will do that.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.