Effective Way to Remove Existing Duplicate Documents in ElasticSearch

test_tester · December 16, 2020, 3:30am

Hi Everyone,

Using aggregation, I am able query out doc_count: 272152 of duplicates instances in my elasticsearch database.

The problem now is if I were to simply run a _delete_by_query, it will delete everything including the original.

What effective strategy can I use to retain my original file?

Reading online, I've read that one possible solution is to run a first request to get the min value of the timestamp field with a min aggregation. And then exclude this value in your search body.

Please advise.

{
  "size": 0,
  "aggs": {
    "duplicateDocs": {
      "filter": {
        "bool": {
          "must": [
            {
              "range": {
                "createdDate": {
                  "from": "2020-06-01",
                  "to": null,
                  "include_lower": true,
                  "include_upper": true,
                  "boost": 1
                }
              }
            },
            {
              "bool": {
                "should": [
                  {
                    "terms": {
                      "messageType": [
                        "short",
                        "long",
                        "superlong"
                      ]
                    }
                  },
                  {
                    "prefix": {
                      "messagetype": "veryshort"
                    }
                  }
                ],
                "must_not": [
                  {
                    "terms": {
                      "matchingType": [
                        "six",
                        "one"
                      ]
                    }
                  }
                ]
              }
            }
          ]
        }
      },
      "aggs": {
        "duplicateCount": {
          "terms": {
            "field": "Txn_Ref.keyword",
            "min_doc_count": 2,
            "size": 1000
          }
        }
      }
    }
  }
}

AClerk · December 16, 2020, 3:59am

you can extract all ids.
and then use delete API to delete all those documents.
It is 5 minutes of work.
The syntax is something like

POST my_index/_delete_by_query
{
    "query" : {
        "terms" : {
            "_id" : 
              [ "id1",
              "id2",
              "id9",
              "id832" ]
        }
    }
}

test_tester · December 16, 2020, 4:04am

Hi AClerk,

Thank you for the idea -> however as mentioned there are at least 200k instances. So that means I would have to manually eyeball and go through 200k IDs for your method to work (it took me 2 hours to sieve through 100 txns).

Is there a way to quickly extract all 'wrong' ids?

AClerk · December 16, 2020, 4:07am

Extract them all and filter in a spreadsheet (like excel).
Then build your query with the filtered IDs.

I dont know how your data looks like, so it is hard to say how exactly you can filter and extract.

test_tester · December 16, 2020, 4:13am

When you say extract them all to spreadsheet, how do I do that?

From what i know, elastic search plugin > structured query can download CSV files. But does not exactly allow for aggregation query itself to be executed.

If you don't mind could you advise from my existing query, how I can effectively extract the data to an excel spread sheet?

AClerk · December 16, 2020, 4:35am

You can use SQL.
Works well in 7.9.1 I have

POST /_sql?format=csv
{
  "query": "SELECT filed_1, filed_2, filed_3, filed_N FROM \"my-index\" where ... order by ..."
}

Then just copy&paste into a spreadsheet and process the data there.

test_tester · December 16, 2020, 4:38am

Unfortunately, we are currently using Version 7.4.2.

Are there any other alternatives or is it mandatory to do an version upgrade?

AClerk · December 16, 2020, 4:41am

And? it is not supported?

test_tester · December 16, 2020, 6:06am

I just tried, and it wasn't working, something about

"type": "illegal_argument_exception",
"reason": "Rejecting mapping update to [testindex] as the final mapping would have more than 1 type: [_doc, _sql]"

What I did was I replaced my url from
testindex/_sql?format=csv

to

testindex/_doc?format=csv

However nothing happens after that, it says successful and this was the output:
{

"_index": "testindex",
"_type": "_doc",
"_id": "QsYwanYBg7zkLFbasLND",
"_version": 1,
"result": "created",
"_shards": {
- "total": 2,
- "successful": 1,
- "failed": 0},
"_seq_no": 630563,
"_primary_term": 16
}

AClerk · December 16, 2020, 10:57pm

Can you provide the full script?

test_tester · December 17, 2020, 3:35am

I did this:

POST
testindex/_sql?format=csv
{"query":"SELECT * FROM testindex WHERE created_date < '2020-12-16'"}

And it threw this error

* "type": "illegal_argument_exception",
* "reason": "Rejecting mapping update to [testindex] as the final mapping would have more than 1 type: [_doc, _sql]"

Then i changed the script to:

POST
testindex/_**doc**?format=csv
{"query":"SELECT * FROM testindex WHERE created_date < '2020-12-16

And finally got this:

* "_index": "testindex",
* "_type": "_doc",
* "_id": "QsYwanYBg7zkLFbasLND",
* "_version": 1,
* "result": "created",
* "_shards": {
  * "total": 2,
  * "successful": 1,
  * "failed": 0},
* "_seq_no": 630563,
* "_primary_term": 16
}

AClerk · December 17, 2020, 11:31pm

Instead of select *, try select field_1
If it works, add the other relevant fields.

Also, need not specify the index. You will say which one it is in the from clause

POST
**testindex**/_sql?format=csv
{
    ....
}

-->

POST /_sql?format=csv
{
  "query": "SELECT filed_1, filed_2, filed_3, filed_N FROM \"my-index\" where ... order by ..."
}

system · January 14, 2021, 11:31pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Find and delete duplicate documents Elasticsearch	8	26218	July 27, 2018
Duplicate documents in Elasticsearch Elasticsearch	1	1001	June 23, 2017
How to remove duplicate values in ealstic search Elasticsearch	27	24338	July 26, 2018
Delete duplicate items Elasticsearch	1	333	July 6, 2017
Deduplication of records with deletion code Elasticsearch	1	1292	April 3, 2018

Effective Way to Remove Existing Duplicate Documents in ElasticSearch

Related topics