Dedup Records According to Percentage

Dawood_Siddiq · August 6, 2019, 1:59pm

Greetings : I have Above than 1M records in ES. Now i want to dedup data on bases of percentage. For Example "Give me list of all records whose title are 90% matched or above".
Lets take another example. "I need to retrieve all records whose locations are 80% or above matched".

I try to dedup records according to title but i need to retrieve by percentage.

GET index/_search
{
  "size": 0, 
  "aggs": {
    "duplicate": {
         "terms": {
              "script": "doc['product_title'].value",
              "size": 1000
             },
         "aggs": {
            "duplicate list": {
            "top_hits": {
            }
        }
      }
    }
  } 
}

How can i fetch duplicate records whose specific column match by defined percentage with other records. Any help would be appreciable. Thanks

xeraa · August 6, 2019, 2:56pm

What is a 90% match of a title? 90% of the characters are the same?

Generally, if you try to express scores in percentage it won't end well and the Lucene docs are pretty explicit about that.

Dawood_Siddiq · August 6, 2019, 3:23pm

thanks for reply @xeraa . what else i do to find solution of this kind of problem ?

xeraa · August 6, 2019, 9:42pm

For starters, what is an 80% match? How do you calculate that with concrete examples?

Dawood_Siddiq · August 7, 2019, 9:21am

For example i have title "Greenland swimming Pools Georgia" ...
Now 100% match is definitely we all know . if i have a record having title "swimming pools georgia" then i think its similar to previous one approx 70% . I just want to dig out the algorithm who can help me to fetch these similar records with scoring .

Is any way to find similar records like this in ES ?

xeraa · August 7, 2019, 11:46am

Sounds like the more like this feature. Does that work for you?

Dawood_Siddiq · August 7, 2019, 11:50am

No @xeraa . more like this feature doesn't work for me .

xeraa · August 7, 2019, 12:44pm

Ok, why not? We're not making much progress with that answer alone.

system · September 4, 2019, 12:44pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Is there is any way to calculate percentage of search match in ES? Elasticsearch	1	1322	July 6, 2017
Search based on percentage of similarity Elasticsearch	1	355	December 22, 2020
Percentage of matched terms in Elasticsearch Elasticsearch	1	2686	July 5, 2017
Find similar records through MLT from millions records Elasticsearch	1	318	January 24, 2019
Need help on similarity ranking approach Elasticsearch	9	516	July 6, 2017

Dedup Records According to Percentage

Related topics