Logstash -> drop duplicate -> elasticsearch

Exocomp · March 8, 2017, 11:42am

Hello,

Looking over documentation I see that the field "document_id" can be used as a way to upsert in Elasticsearch.

However, is there way to detect the duplicate at the filter stage and just drop?

Thanks,
E

magnusbaeck · March 12, 2017, 7:38pm

Theoretically you could probably use the elasticsearch filter but that would be insanely inefficient.

Exocomp · March 12, 2017, 11:24pm

@magnusbaeck yea that would not be good.

How about something like the following:

Local file that tracks all the document ids which gets populated at the output stage
At the filter stage you simple check against that file and if you find a matching document id you drop

Side effects would be if elasticsearch output fails and file output succeeds then you don't get document in elasticsearch and think it succeeded. To resolve this side effect you would need a third step.

If elasticsearch output fails then catch the failure and remove the document id from the local file

This would be a pretty efficient solution, based on your experience can you recommend the steps necessary to make this possible.

Thanks,
E

magnusbaeck · March 13, 2017, 7:58am

Local file that tracks all the document ids which gets populated at the output stage

At the filter stage you simple check against that file and if you find a matching document id you drop

Yes, that approach would work. Write the IDs to a file with a file output and use a translate filter to look up from that file.

If elasticsearch output fails then catch the failure and remove the document id from the local file

Sorry, that's not possible unless you modify the elasticsearch output.

Exocomp · March 13, 2017, 3:18pm

@magnusbaeck thanks for your input.

Regarding the 3rd step, I filed a feature request https://github.com/elastic/logstash/issues/6814

Thanks,
E

Christian_Dahlqvist · March 13, 2017, 3:23pm

The problem with trying to detect duplicate entries in Logstash is that it basically requires all events (at least within a stream) to pass through a single instance, and therefore does not scale well. I think it is generally better to assign a unique id and let Elasticsearch handle it.

Exocomp · March 13, 2017, 3:42pm

@Christian_Dahlqvist I see what your saying, so basically in order to detect a duplicate in logstash properly you'd have to create some kind of duplicate service and find a way to scale it which adds more complexity. Hmm. Ok, so as you said if Elasticsearch can handle it, basically a feature in elasticsearch to drop it would be more efficient. I will move my feature request from logstash to elasticsearch.

Thanks,
E

Exocomp · March 13, 2017, 4:10pm

@magnusbaeck @Christian_Dahlqvist

@theuntergeek mentioned that this option already exists with elasticsearch with the action create option. See here https://github.com/logstash-plugins/logstash-output-elasticsearch/issues/574

So if the document id already exists it will just fail, seems more efficient to me then doing the upsert, I'll go ahead with that option.

Thanks for everyone's input.

system · April 10, 2017, 4:10pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
ES query to check the existence of a document_id? Logstash	10	992	June 26, 2020
Logstash don't detect duplicated documents Logstash	2	278	July 3, 2018
Avoid duplicate document in different Indices,Logsatsh Logstash	2	496	July 28, 2022
How to stop duplicate entries using elasticsearch plugin Logstash	10	6110	June 29, 2017
How to use document id to avoid duplication of logs? Logstash	8	1878	June 26, 2020

Logstash -> drop duplicate -> elasticsearch

Related topics