Find Duplicate records in data


(Sudip) #1

I have database of 100 thousands persons record. I need to find duplicate records using different matching fields.

currently i can figure out duplicate records using dedup query, that is limited to only only field.

POST my_index/my_customer/_search
{
  "aggs":{
    "dedup" : {
      "terms": {
        "field":  "FirstName"
       },
       "aggs":{
         "dedup_docs":{
           "top_hits":{
             "size":1
           }
         }
       }    
    }
  }
}

but when i tried for multiple fields, the results are expanding rather than minimizing

POST my_index/my_customer/_search
{
  "aggs":{
    "dedup" :
     { "terms":  {
      "script": "doc['FirstName'].values  + doc['LastName'].values ",
      "min_doc_count": 2
    },
       "aggs":{
         "dedup_docs":{
           "top_hits":{
             "size":2
           }
         }
       }    
    }
  }
}

i have more fields that is nested inside my_customer, that i need to query for matching duplicate records.

Anybody have similar issue resolved?


(Nik Everett) #2

Two points/questions:

  1. Always wrap your code blocks in ``` so they are easier to read. I edited your post with my god-like powers to add them in so I could read it.
  2. Are FirstName and LastName analyzed? You may want to have a look at what some of your docs come back as when you use that same script as a script_fields. You are likely better off getting the non-analyzed version. Or maybe a version with only light analysis. You can try it using _source['FirstName] instead of doc['FirstName'] to get a non-analyzed version but beware - that is orders of magnitude slower. Like, woah, that is really slow, slower. If you happen to have a .raw style field that is going to be faster.

(Sudip) #3

yes, it is lowercase and keyword custom analyzer.
what is .raw style field?


(Nik Everett) #4

Sometimes you'd use the multi-field pattern to declare a couple of ways to analyze the string, one of which doesn't do any tokenizing. For this use case lowercase and keyword is fine. So you don't need it.

Do the results make sense when you ask for a few documents with your script in a script_fields part of the request? I expect if you are getting funny results then the script_field will show it. Like, for example, does your script actually make a list of results rather than concatenate the strings? In which case you'd want .value instead of .values.


(Sudip) #5

thank you for your time

Yes, i am using in line script, and asking for few documents also make sense, for using script for 'FirstName' only give the first name results back and 'LastName' only gives last name results back but when i combine that as [quote="mr_search, post:1, topic:42893"]
"doc['FirstName'].values + doc['LastName'].values "
[/quote]

it combines the condition and gives matching condition for "Either Or" with out ranking.

I need is result of matching both condition only.


(Nik Everett) #6

Right. You want AND. When you ask for a few documents, do the results come back as "FirstNameLastName" or ["FirstName", "LastName"]. The first one should work, the last one probably won't.


(Sudip) #7

when i ask for few documents, the results come back as ["FirstName", "LastName"]. its coming by alternate matching, FirstName matching, then last Name and agiain firstName then lastName like that.


(system) #8