Effective Way to Remove Existing Duplicate Documents in ElasticSearch

Hi Everyone,

Using aggregation, I am able query out doc_count: 272152 of duplicates instances in my elasticsearch database.

The problem now is if I were to simply run a _delete_by_query, it will delete everything including the original.

What effective strategy can I use to retain my original file?

Reading online, I've read that one possible solution is to run a first request to get the min value of the timestamp field with a min aggregation. And then exclude this value in your search body.

Please advise.

{
  "size": 0,
  "aggs": {
    "duplicateDocs": {
      "filter": {
        "bool": {
          "must": [
            {
              "range": {
                "createdDate": {
                  "from": "2020-06-01",
                  "to": null,
                  "include_lower": true,
                  "include_upper": true,
                  "boost": 1
                }
              }
            },
            {
              "bool": {
                "should": [
                  {
                    "terms": {
                      "messageType": [
                        "short",
                        "long",
                        "superlong"
                      ]
                    }
                  },
                  {
                    "prefix": {
                      "messagetype": "veryshort"
                    }
                  }
                ],
                "must_not": [
                  {
                    "terms": {
                      "matchingType": [
                        "six",
                        "one"
                      ]
                    }
                  }
                ]
              }
            }
          ]
        }
      },
      "aggs": {
        "duplicateCount": {
          "terms": {
            "field": "Txn_Ref.keyword",
            "min_doc_count": 2,
            "size": 1000
          }
        }
      }
    }
  }
}

you can extract all ids.
and then use delete API to delete all those documents.
It is 5 minutes of work.
The syntax is something like

POST my_index/_delete_by_query
{
    "query" : {
        "terms" : {
            "_id" : 
              [ "id1",
              "id2",
              "id9",
              "id832" ]
        }
    }
}

Hi AClerk,

Thank you for the idea -> however as mentioned there are at least 200k instances. So that means I would have to manually eyeball and go through 200k IDs for your method to work (it took me 2 hours to sieve through 100 txns).

Is there a way to quickly extract all 'wrong' ids?

Extract them all and filter in a spreadsheet (like excel).
Then build your query with the filtered IDs.

I dont know how your data looks like, so it is hard to say how exactly you can filter and extract.

When you say extract them all to spreadsheet, how do I do that?

From what i know, elastic search plugin > structured query can download CSV files. But does not exactly allow for aggregation query itself to be executed.

If you don't mind could you advise from my existing query, how I can effectively extract the data to an excel spread sheet?

You can use SQL.
Works well in 7.9.1 I have

POST /_sql?format=csv
{
  "query": "SELECT filed_1, filed_2, filed_3, filed_N FROM \"my-index\" where ... order by ..."
}

Then just copy&paste into a spreadsheet and process the data there.

Unfortunately, we are currently using Version 7.4.2.

Are there any other alternatives or is it mandatory to do an version upgrade?

And? it is not supported?

I just tried, and it wasn't working, something about

  • "type": "illegal_argument_exception",
  • "reason": "Rejecting mapping update to [testindex] as the final mapping would have more than 1 type: [_doc, _sql]"

What I did was I replaced my url from
testindex/_sql?format=csv

to

testindex/_doc?format=csv

However nothing happens after that, it says successful and this was the output:
{

  • "_index": "testindex",
  • "_type": "_doc",
  • "_id": "QsYwanYBg7zkLFbasLND",
  • "_version": 1,
  • "result": "created",
  • "_shards": {
    • "total": 2,
    • "successful": 1,
    • "failed": 0},
  • "_seq_no": 630563,
  • "_primary_term": 16
    }

Can you provide the full script?

I did this:

POST
testindex/_sql?format=csv
{"query":"SELECT * FROM testindex WHERE created_date < '2020-12-16'"}

And it threw this error

* "type": "illegal_argument_exception",
* "reason": "Rejecting mapping update to [testindex] as the final mapping would have more than 1 type: [_doc, _sql]"

Then i changed the script to:

POST
testindex/_**doc**?format=csv
{"query":"SELECT * FROM testindex WHERE created_date < '2020-12-16

And finally got this:

* "_index": "testindex",
* "_type": "_doc",
* "_id": "QsYwanYBg7zkLFbasLND",
* "_version": 1,
* "result": "created",
* "_shards": {
  * "total": 2,
  * "successful": 1,
  * "failed": 0},
* "_seq_no": 630563,
* "_primary_term": 16
}

Instead of select *, try select field_1
If it works, add the other relevant fields.

Also, need not specify the index. You will say which one it is in the from clause

POST
**testindex**/_sql?format=csv
{
    ....
}

-->

POST /_sql?format=csv
{
  "query": "SELECT filed_1, filed_2, filed_3, filed_N FROM \"my-index\" where ... order by ..."
}

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.