Dedup Records According to Percentage

Greetings : I have Above than 1M records in ES. Now i want to dedup data on bases of percentage. For Example "Give me list of all records whose title are 90% matched or above".
Lets take another example. "I need to retrieve all records whose locations are 80% or above matched".

I try to dedup records according to title but i need to retrieve by percentage.

GET index/_search
{
  "size": 0, 
  "aggs": {
    "duplicate": {
         "terms": {
              "script": "doc['product_title'].value",
              "size": 1000
             },
         "aggs": {
            "duplicate list": {
            "top_hits": {
            }
        }
      }
    }
  } 
}

How can i fetch duplicate records whose specific column match by defined percentage with other records. Any help would be appreciable. Thanks

What is a 90% match of a title? 90% of the characters are the same?

Generally, if you try to express scores in percentage it won't end well and the Lucene docs are pretty explicit about that.

1 Like

thanks for reply @xeraa . what else i do to find solution of this kind of problem ?

For starters, what is an 80% match? How do you calculate that with concrete examples?

For example i have title "Greenland swimming Pools Georgia" ...
Now 100% match is definitely we all know . if i have a record having title "swimming pools georgia" then i think its similar to previous one approx 70% . I just want to dig out the algorithm who can help me to fetch these similar records with scoring .

Is any way to find similar records like this in ES ?

Sounds like the more like this feature. Does that work for you?

No @xeraa . more like this feature doesn't work for me .

Ok, why not? We're not making much progress with that answer alone.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.