How can I delete all the duplicate records except one to keep my data?

Hello
I'm new to ElasticSearch. The project I'm going to show you here is a study project. I could use your help to deduplicate records that have been registered 10 times during half a day.

I tried different things that don't work (a python script, an ElasticSearch query, ...)

Here is a visual, which will be more telling, of my problem:

According to you, how can I delete all the duplicate records except one to keep my data ?

Many thanks
Thierry

Bonjour Thierry et bienvenu! :wink:

I'd probably instead reindex the whole dataset in a new index and I'd use the hash as the document id if possible.

You can use the reindex API to read the data from the source index and send documents to the destination index.
Adding an ingest pipeline to modify the id of the document would be useful.

Then I'd simply drop the old index.

I had tried a very simple reindex before, but it was a mistake. So I thought I was wrong and I gave up...

[root@node21-elasticsearch-kibana es-dedupe]# curl -XPOST -H "Content-Type: application/json" http://10.10.10.21:9200/blockchain/_r eindex?pretty -d '
> {
>     "source":
> {
>       "index": "blockchain"
>     },
>     "dest":
> {
>       "index": "blockchain3"
>     }
> }'

Here is the result :

{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_argument_exception",
        "reason" : "Rejecting mapping update to [blockchain] as the final mapping would have more than 1 type: [_doc, _reindex]"
      }
    ],
    "type" : "illegal_argument_exception",
    "reason" : "Rejecting mapping update to [blockchain] as the final mapping would have more than 1 type: [_doc, _reindex]"
  },
  "status" : 400
}

Try with this instead:

http://10.10.10.21:9200/_reindex?pretty

Thanx for your answers David !
I still have this error (r eindex was a copy error)

it works through kibana in dev tools

Now, how can i pass hash as _id like you said ?

Have a look at https://www.elastic.co/guide/en/elasticsearch/reference/7.5/set-processor.html

Hi,

This error you faced from the command line is because you have the index name in the request. /blockchain/_reindex
It should just be the /_reindex api.

Thanks David and Veera

I've created this script with reindex and processor but i still don't have my hash in _id. Do you know what can be wrong in my script ?

#!/bin/bash

curl -X DELETE http://10.10.10.21:9200/blockchain3?pretty && \
curl -XPUT -H "Content-Type: application/json" http://10.10.10.21:9200/blockchain3 -d @/apps/code/elasticsearch/blockchain/mapping.json

curl -XPUT -H "Content-Type: application/json" http://10.10.10.21:9200/_ingest/pipeline/set_id?pretty -d '
{
  "description": "sets the value of _id from the field hash",
  "processors": [
    {
      "set": {
        "field": "_id",
        "value": "{{hash}}"
      }
    }
  ]
}'

curl -XPOST -H "Content-Type: application/json" http://10.10.10.21:9200/_reindex?pretty -d '
{
  "source": {
    "index": "blockchain"
  },
  "dest": {
    "index": "blockchain3"
  }
}'

Please don't post images of text as they are hard to read, may not display correctly for everyone, and are not searchable.

Instead, paste the text and format it with </> icon or pairs of triple backticks (```), and check the preview window to make sure it's properly formatted before posting it. This makes it more likely that your question will receive a useful answer.

It would be great if you could update your post to solve this.

done ! :slight_smile:

The reindex doc says:

Reindex can also use the Ingest node feature by specifying a pipeline...

So you should try:

curl -XPOST -H "Content-Type: application/json" http://10.10.10.21:9200/_reindex?pretty -d '
{
  "source": {
    "index": "blockchain"
  },
  "dest": {
    "index": "blockchain3",
    "pipeline": "set_id"
  }
}'
1 Like

Youpi !!
My data are reindexed, deduplicate and have an unique _id !!
Many thanks :smiley:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.