How to reindex distinct data in elasticsearch


(Gayathri TR) #1

Hi Team,

I am able to reindex the data in elasticsearch using:

  • curl -XPOST http://localhost:9200/_reindex?pretty -d'
    {

    "source": {
    "index": "old_index"
    },
    "dest": {
    "index": "new_index"
    },
    "script": {
    "inline": "ctx._source.field_new = ctx._source.remove("field")"
    }
    }'

I have many duplicate log entries in my index..I want to reindex by removing duplicate entries.
Could u please suggest any method?


(Mark Harwood) #2

The scripts that I use for entity-centric indexing [1] sort content in a source index by a common key and consolidate multiple docs into a update on a single document in the target index. The "pull" from the source index and the "push" to the target index are both done using the respective bulk APIs.
Your use case is slightly different in that you want to insert a single doc in the target index rather than update one but you should be able to adapt the included python script with few changes.

[1] http://bit.ly/entcent


(Kimbro Staken) #3

Just a suggestion, come up with a way to create a key from the data in the log entries and then use create requests to save the data into a new index under that key. If there are duplicate entries then only the first will be indexed and the duplicates will be dropped. Unfortunately, I don't know if the reindex api supports this.

Kimbro


(Gayathri TR) #4

Is it possible to remove the duplicate log entries from an existing index ?


(Kimbro Staken) #5

There's no easy way to do that. You'd still have to walk the entire index and figure out what are duplicates then delete them. It will be a lot simpler to simply reindex and drop the duplicates in that process.


(system) #6