Identifying Duplicate Records Based on Multiple Fields in Elasticsearch

I have a dataset in Elasticsearch containing records related to users and their assistants. I need to identify duplicate records for a specific file_id. A record should be considered a duplicate if the fields FirstName, LastName, MailingAddress, and PropertyAddress match with previously uploaded records.

I have attempted the following query:


POST /list/_search
{
 "size": 0,
 "query": {
  "bool": {
   "must": [
    {"terms": {"user_id": [2431, 2556, 2557]}}
   ]
  },
  "aggs": {
   "files": {
    "terms": {
     "field": "file_id",
     "include": ["104242"]
    },
    "aggs": {
     "duplicate_records": {
      "terms": {
       "script": {
        "source": "doc['FirstName.keyword'].value + ',' + doc['LastName.keyword'].value + ',' + doc['MailingAddress.keyword'].value + ',' + doc['PropertyAddress.keyword'].value",
        "lang": "painless"
       },
       "min_doc_count": 2
      }
     }
    }
   }
  }
 }
}

Unfortunately, I'm not getting the desired results. I would appreciate any guidance or suggestions on how to improve the query to achieve the identification of duplicate records based on the specified fields.

Thank you in advance for your help and insights.

How many indices and primary shards is this data set distributed across? What is the total size of the data set across which you are looking for dumplicates?

The dataset is spread across 5 primary shards, and the total size of the dataset for duplicate checking is 600k.

The approach you havve taken will as far as I can see only find duplicates if they reside in the same shard. Your approch should therefore work if you only had a single primary shard. I would recommend using the shrink index APi to create a copy of your index with just a single primary shard and try your approach on that to see if there is any difference.

If this is something you want to be able to do on a regular basis I think we need to know more about the data and the use case to find a suitable approach.

Are the fields you use for deduplication immutable or can they be updated? If they can be updated, how are you updating them and how frequently are they updated?

What is the total size of the index (just primary shards)? How is this expected to grow in the future?

Thank you for your guidance and advice on this. Much appreciated.

Total size: 7.9gb
Primaries size: 3.9gb

The fields we use for duplication are unchangeable. We do not update them. However, there are a few other fields that may be updated occasionally.

It is expected to grow steadily as users continue to upload files containing records, currently averaging around 100k.

Is there an alternative solution available without changing index settings or converting multiple shards to one primary shard? I was contemplating modifying the mentioned code if there is any solution in that regard.

3.9GB is not large for a single shard, so even if the data volume grows a few times you should be able to only need a single primary shard, which would solve your problem.

If you feel you still need more than a single primary shard it would help to know how you are indexing your data into Elasticsearch? What tool are you using?

Is there a risk of data loss, or is it advisable to proceed after creating a backup of the index?

Our Elasticsearch data is visualized and explored using Kibana.

Index settings:
{
"list" : {
"settings" : {
"index" : {
"number_of_shards" : "5",
"number_of_replicas" : "1",
"version" : {
"created" : "7090199"
},
"provided_name" : "list"
}
}
}
}

example of one of our field mappings:

"MailingAddress" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},

I've created a new index with one primary shard and tested the script, but the expected result hasn't been achieved. Sharing the relevant code snippets for your review.

POST /list_new/_search
{
  "size": 0,
  "query": {
    "bool": {
      "must": [
        {"terms": {"user_id": [2431, 2556, 2557]}}
      ]
    }
  },
"aggs": {
   "files": {
    "terms": {
     "field": "file_id",
     "include": ["104242"]
    },
      "aggs": {
        "duplicate_records": {
          "terms": {
            "script": {
              "source": "doc['FirstName.keyword'].value + ',' + doc['LastName.keyword'].value + ',' + doc['MailingAddress.keyword'].value + ',' + doc['PropertyAddress.keyword'].value",
              "lang": "painless"
            },
            "min_doc_count": 2
          }
        }
      }
    }
  }
}
{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "files" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : 104242,
          "doc_count" : 500,
          "duplicate_records" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 475,
            "buckets" : [ ]
          }
        }
      ]
    }
  }
}

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.