How to identify and remove duplicates in Elasticsearch index

we are using elasticsearch 7.11.1 recently we observed an issue and below are the points for it.

  1. we store data for every 15 mins interval and we get time stamp from our input file (ex: 05:00, 23:15, 20:30, 11:45 )
  2. recently we observed our input file at 23:15 has 1890 records, but index has 3533 records.
  3. now we want to delete 1643 duplicate records from index, with out disturbing 1890 records.

We need API query for that.

for example

input file

name product sale id
sai pen 100 1
kumar car 30 2
sai pen 100 1
sai pen 100 1
ram bike 288 3
kumar car 30 2

After deleting duplicates my index should look like below,

name product sale id
sai pen 100 1
ram bike 288 3
kumar car 30 2

I need help with

  1. query to find only duplicates at 23:15
  2. query to delete duplicates

Can you please share the API query for the above issue.

One way we can do this is be concatenating all 4 fields into one field and then if that field count > 1 , then use delete by query to delete that duplicate.
If you are using some time field like timestamp or updated , I believe we can use delete by query for this.

Please help us with the query to delete the duplicate records at time stamp of 23:15

Maybe this blog post might be useful? I am not sure there is a way to reliably create a query to use with delete by query to handle this, so the approach described in the blog post may be safer.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.