ES query to check the existence of a document_id?

Nani_20 · May 29, 2020, 4:39am

I'm using a file input in logstash to input files but I encounter duplicates time to time. Currently, I'm using the following the below options in ES output plugin to avoid duplicates.

document_id => "%{message}"
action => "create"

But, I see these duplicates passing through the filter plugin which has lot of ruby operations. Is there any way to do this in an ES filter plugin to query efficiently and avoid these duplicates passing through my ruby operations.

Christian_Dahlqvist · May 29, 2020, 4:45am

What does the rest of the elasticsearch output configuration look like? Can you show us and example of two duplicate events that have been inserted into Elasticsearch, including the _id and _index?

Nani_20 · May 29, 2020, 4:59am

@Christian_Dahlqvist Thanks for your reply. Find below my ES output plugin.

ES output plugin

elasticsearch{
		hosts => ["http://localhost:9200"]
		index => "sample_index"
		#document_id => "%{message}"
		#action => "create"
   }

I'm using logstash file input to index various log files available in the input location. These logfiles are generated by log4j in various systems and are fetched using a cron job to the input location. By default, log4j creates the backup of a log file after a size limit, so logstash receives duplicate logs from time to time. These log lines have generally a timestamp and other information about the events. I'm using full message as an ID to avoid duplicates.

Christian_Dahlqvist · May 29, 2020, 5:10am

This does not correspond to what you posted earlier. Using a full message as an id seems like a bad idea as it could contain illegal or unsuitable characters. It is probably better to create and use a hash. I would recommend reading this blog post as well as this one.

Nani_20 · May 29, 2020, 8:24am

Sorry my bad, I have corrected the config file now. I think I had different file copied initially.
Anyways, I would like to use the entire message as an ID. As you suggested, I think it would be ideal to generate UUID using fingerprint filter and then use it to avoid duplicates. I am I right here?
But, this will still allow duplicates to enter the other filter plugins right. Is there any way I can avoid/drop these duplicates right after creating the ID by fingerprint filter and check its existence in ES instead of waiting till ES output plugin. This will save lot of time and effort.

Please refer to this post for your reference

Christian_Dahlqvist · May 29, 2020, 8:49am

If you associate a UUID with the event you need to do this directly at the source, e.g. in filebeat, for it to work. There is as far as I know no way to eliminate duplicates within the pipeline before the data gets to Elasticsearch.

Nani_20 · May 29, 2020, 8:50am

I'm afraid that won't be possible as I'm reading the files using file input. So, I have to do this in the ES output plugin itself right?

Nani_20 · May 29, 2020, 12:36pm

@Christian_Dahlqvist Could you please let me know of any alternative for file input?
Thanks in advance

Badger · May 29, 2020, 1:19pm

What I was suggesting was to generate the document_id, using fingerprint or something, then use an elasticsearch filter to check whether the document exists. If it does, drop the event, otherwise, continue enriching the event and send it to the elasticsearch output.

The question was (or should have been) how to use an elasticsearch filter to check for the existence of a given document_id in an index.

Christian_Dahlqvist · May 29, 2020, 1:29pm

Given that approach will result in a query per document it will be very slow and not scale well. There is s also the issue around indexed events not being searchable immediately but with a low throughput that may be less of an issue.

system · June 26, 2020, 1:29pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to avoid duplicates before entering the filter plugin? Logstash	6	616	June 26, 2020
Logstash -> drop duplicate -> elasticsearch Logstash	8	3572	April 10, 2017
Dupilcate message on elasticsearch Logstash	6	952	December 31, 2017
Avoid duplication Logstash	13	5197	December 7, 2018
How to ignore the same log contents in the different log files? Logstash	5	435	June 13, 2018

ES query to check the existence of a document_id?

Related topics