Help with aggregation to identify dups

LucasV · January 26, 2019, 9:40pm

Hey folks,

I am hoping someone could help me figure this issue out. Been at it for a few days and feel like i have hit a wall. I'd really appreciate any help on this topic. Here is the issue -

I am populating address records into an index. The problem is that the source sends a ton of duplicates that need de-duplication based on address and city (unfortunately there is no PK other than these text fields so i understand this is not going to be the fastest query, but i am ok with the performance aspect of it for now) This deduplication doesn't need to happen in real time and we are ok doing it once a day. There are about 16 million documents and about 800k of them look like duplicates. Here is what i have done so far -

First and foremost, i ignore any rows where the address or city is null. I also consider only those documents that have record_status set to 'active'. See step #3 for more on this flag.
Find duplicates using the following aggregation query
Once i get the results back in chunks, i issue an update query back to ES to flip an attribute called record_status = 'inactive' so that i exclude them in the next aggregation call.
For whatever reason, it doesn't seem to be capturing all duplicate documents when creating buckets. I can see that a particular address + city combo exists 31 times in the index but the bucket only lists 6 of them. What's worse, when i include that particular address manually, it shows all 31 docs. What gives??

Here's the DSL query i am using

{
  "size": 0,
  "aggs": {
"addresses": {
  "terms": {
    "script" : {
        "lang": "painless",
        "source": "if ((doc['display_address.keyword'].value != null && doc['city.keyword'].value != null) && doc['record_status.keyword'].value == 'active') {return doc['display_address.keyword'].value + doc['city.keyword'].value;} else {return null;}"
  },
  "size": 500,
  "exclude": "null"
},
"aggs": {
  "my_filter": {
    "bucket_selector": {
      "buckets_path": {
        "the_doc_count": "_count"
      },
      "script": "params.the_doc_count > 1"
    }
  },
  "final_docs": {
    "top_hits": {
      "sort": [
        {
          "list_date": {
            "order": "desc"
          }
        }
      ],
      "_source": {
        "includes": [
          "_id",
          "list_date",
          "display_address",
          "city"
        ]
      },
      "size": 100
    }
  }
}
  }
}
}

This above query gives me incorrect data for several addresses. For example, for 123 Main st, Detroit comes back with only 6 documents.

Instead if I just add

"include": "123 Main st"

in my aggregation, all of a sudden i see 31 duplicates for 123 Main st, Detroit

LucasV · January 28, 2019, 5:11pm

Any pointers? anyone?

LucasV · February 4, 2019, 10:18pm

Found the answer myself after struggling with this issue for weeks. Leaving it here hoping it will help someone else.

Turns out that terms aggregation in elastic search is not always a 100% accurate. It is mentioned in the docs. Anyways, if you are looking to get precise results, you have to use composite aggregation. I switched up my query as follows and it seems to be working properly now

GET properties/_search?
{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "record_status.keyword": "active"
          }
  },
  "aggs": {
    "my_buckets": {
      "composite": {
        "size": 500,
        "sources": [
          {
            "product": {
              "terms": {
                "field": "unique_address_key.keyword"
              }
            }
          }
        ]
      },
      "aggs": {
        "top_sales_hits": {
          "top_hits": {
            "sort": [
              {
                "list_date": {
                  "order": "desc"
                }
              }
            ],
            "_source": {
              "includes": [
                "display_address"
              ]
            },
            "size": 1
          }
        },
        "aggs": {
          "filter": {
            "bucket_selector": {
              "buckets_path": {
                "doc_count": "top_hits.doc_count"
              },
              "script": "_count > 2"
            }
          }
        }
      }
    }
  }
}

system · March 4, 2019, 10:18pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Using aggregation in Elasticsearch to find duplicate data in the same index Elasticsearch painless	6	14124	January 22, 2021
Find Duplicate records in data Elasticsearch	7	18665	July 5, 2017
Deduplication of records with deletion code Elasticsearch	1	1280	April 3, 2018
Aggregation still returning duplicates Elasticsearch	6	3086	February 12, 2018
Duplicates Query not returning all results Elasticsearch	3	1552	July 5, 2017

Help with aggregation to identify dups

Related topics