ES query to check the existence of a document_id?

I'm using a file input in logstash to input files but I encounter duplicates time to time. Currently, I'm using the following the below options in ES output plugin to avoid duplicates.

document_id => "%{message}"
action => "create"

But, I see these duplicates passing through the filter plugin which has lot of ruby operations. Is there any way to do this in an ES filter plugin to query efficiently and avoid these duplicates passing through my ruby operations.

What does the rest of the elasticsearch output configuration look like? Can you show us and example of two duplicate events that have been inserted into Elasticsearch, including the _id and _index?

@Christian_Dahlqvist Thanks for your reply. Find below my ES output plugin.

ES output plugin

		hosts => ["http://localhost:9200"]
		index => "sample_index"
		#document_id => "%{message}"
		#action => "create"

I'm using logstash file input to index various log files available in the input location. These logfiles are generated by log4j in various systems and are fetched using a cron job to the input location. By default, log4j creates the backup of a log file after a size limit, so logstash receives duplicate logs from time to time. These log lines have generally a timestamp and other information about the events. I'm using full message as an ID to avoid duplicates.

This does not correspond to what you posted earlier. Using a full message as an id seems like a bad idea as it could contain illegal or unsuitable characters. It is probably better to create and use a hash. I would recommend reading this blog post as well as this one.

Sorry my bad, I have corrected the config file now. I think I had different file copied initially.
Anyways, I would like to use the entire message as an ID. As you suggested, I think it would be ideal to generate UUID using fingerprint filter and then use it to avoid duplicates. I am I right here?
But, this will still allow duplicates to enter the other filter plugins right. Is there any way I can avoid/drop these duplicates right after creating the ID by fingerprint filter and check its existence in ES instead of waiting till ES output plugin. This will save lot of time and effort.

Please refer to this post for your reference

If you associate a UUID with the event you need to do this directly at the source, e.g. in filebeat, for it to work. There is as far as I know no way to eliminate duplicates within the pipeline before the data gets to Elasticsearch.

I'm afraid that won't be possible as I'm reading the files using file input. So, I have to do this in the ES output plugin itself right?

@Christian_Dahlqvist Could you please let me know of any alternative for file input?
Thanks in advance

What I was suggesting was to generate the document_id, using fingerprint or something, then use an elasticsearch filter to check whether the document exists. If it does, drop the event, otherwise, continue enriching the event and send it to the elasticsearch output.

The question was (or should have been) how to use an elasticsearch filter to check for the existence of a given document_id in an index.

Given that approach will result in a query per document it will be very slow and not scale well. There is s also the issue around indexed events not being searchable immediately but with a low throughput that may be less of an issue.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.