I'm using a document_id and action => create in the logstash elasticsearch output filter to avoid duplicates. Since, this is in output plugin, I can see the duplicates passing through all my filter plugin operations.
Is there any way to identify the duplicate before the filter plugin and avoid it?
There are cases where it can be done upstream. For example, a jdbc input might be configured with 'SELECT DISTINCT' which would eliminate duplicates.
In the filters, if you are writing to elasticsearch you might be able to add an elasticsearch filter to query the existence of a document before processing it. But that is not cheap and may not be an optimization. You would need to benchmark both with and without.
I'm using a file input here. So, I believe ES filter query would be the option for me. I have lot of ruby operations within my filter so I guess ES filter query would be efficient than passing the duplicates through these ruby operations.
Any leads/suggestions on how to query this in ES filter?
Also, currently I'm using just the document_id in the ES output filter to avoid duplicates indexing into ES. But, I see fingerprint filter doing something similar. Could you enlighten me on whats the difference between these two? Is there anyway I can use fingerprint to avoid duplicates entering filter plugin instead of document_id?
I do not run elasticsearch so I cannot help with configuring that part. I suggest you ask a new question about how to use an elasticsearch filter to test whether a document id already exists in an index.
No. You avoid duplicates by making sure a duplicate document has the same document_id. The fingerprint filter can be used to generate that id.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.