Logstash -> drop duplicate -> elasticsearch

Hello,

Looking over documentation I see that the field "document_id" can be used as a way to upsert in Elasticsearch.

However, is there way to detect the duplicate at the filter stage and just drop?

Thanks,
E

Theoretically you could probably use the elasticsearch filter but that would be insanely inefficient.

@magnusbaeck yea that would not be good.

How about something like the following:

  1. Local file that tracks all the document ids which gets populated at the output stage
  2. At the filter stage you simple check against that file and if you find a matching document id you drop

Side effects would be if elasticsearch output fails and file output succeeds then you don't get document in elasticsearch and think it succeeded. To resolve this side effect you would need a third step.

  1. If elasticsearch output fails then catch the failure and remove the document id from the local file

This would be a pretty efficient solution, based on your experience can you recommend the steps necessary to make this possible.

Thanks,
E

  1. Local file that tracks all the document ids which gets populated at the output stage
  1. At the filter stage you simple check against that file and if you find a matching document id you drop

Yes, that approach would work. Write the IDs to a file with a file output and use a translate filter to look up from that file.

  1. If elasticsearch output fails then catch the failure and remove the document id from the local file

Sorry, that's not possible unless you modify the elasticsearch output.

@magnusbaeck thanks for your input.

Regarding the 3rd step, I filed a feature request https://github.com/elastic/logstash/issues/6814

Thanks,
E

The problem with trying to detect duplicate entries in Logstash is that it basically requires all events (at least within a stream) to pass through a single instance, and therefore does not scale well. I think it is generally better to assign a unique id and let Elasticsearch handle it.

@Christian_Dahlqvist I see what your saying, so basically in order to detect a duplicate in logstash properly you'd have to create some kind of duplicate service and find a way to scale it which adds more complexity. Hmm. Ok, so as you said if Elasticsearch can handle it, basically a feature in elasticsearch to drop it would be more efficient. I will move my feature request from logstash to elasticsearch.

Thanks,
E

@magnusbaeck @Christian_Dahlqvist

@theuntergeek mentioned that this option already exists with elasticsearch with the action create option. See here https://github.com/logstash-plugins/logstash-output-elasticsearch/issues/574

So if the document id already exists it will just fail, seems more efficient to me then doing the upsert, I'll go ahead with that option.

Thanks for everyone's input.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.