Help with aggregation to identify dups

Hey folks,

I am hoping someone could help me figure this issue out. Been at it for a few days and feel like i have hit a wall. I'd really appreciate any help on this topic. Here is the issue -

I am populating address records into an index. The problem is that the source sends a ton of duplicates that need de-duplication based on address and city (unfortunately there is no PK other than these text fields so i understand this is not going to be the fastest query, but i am ok with the performance aspect of it for now) This deduplication doesn't need to happen in real time and we are ok doing it once a day. There are about 16 million documents and about 800k of them look like duplicates. Here is what i have done so far -

  1. First and foremost, i ignore any rows where the address or city is null. I also consider only those documents that have record_status set to 'active'. See step #3 for more on this flag.
  2. Find duplicates using the following aggregation query
  3. Once i get the results back in chunks, i issue an update query back to ES to flip an attribute called record_status = 'inactive' so that i exclude them in the next aggregation call.
  4. For whatever reason, it doesn't seem to be capturing all duplicate documents when creating buckets. I can see that a particular address + city combo exists 31 times in the index but the bucket only lists 6 of them. What's worse, when i include that particular address manually, it shows all 31 docs. What gives??

Here's the DSL query i am using

{
  "size": 0,
  "aggs": {
"addresses": {
  "terms": {
    "script" : {
        "lang": "painless",
        "source": "if ((doc['display_address.keyword'].value != null && doc['city.keyword'].value != null) && doc['record_status.keyword'].value == 'active') {return doc['display_address.keyword'].value + doc['city.keyword'].value;} else {return null;}"
  },
  "size": 500,
  "exclude": "null"
},
"aggs": {
  "my_filter": {
    "bucket_selector": {
      "buckets_path": {
        "the_doc_count": "_count"
      },
      "script": "params.the_doc_count > 1"
    }
  },
  "final_docs": {
    "top_hits": {
      "sort": [
        {
          "list_date": {
            "order": "desc"
          }
        }
      ],
      "_source": {
        "includes": [
          "_id",
          "list_date",
          "display_address",
          "city"
        ]
      },
      "size": 100
    }
  }
}
  }
}
}

This above query gives me incorrect data for several addresses. For example, for 123 Main st, Detroit comes back with only 6 documents.

Instead if I just add

"include": "123 Main st"

in my aggregation, all of a sudden i see 31 duplicates for 123 Main st, Detroit

Any pointers? anyone?

Found the answer myself after struggling with this issue for weeks. Leaving it here hoping it will help someone else.

Turns out that terms aggregation in elastic search is not always a 100% accurate. It is mentioned in the docs. Anyways, if you are looking to get precise results, you have to use composite aggregation. I switched up my query as follows and it seems to be working properly now

GET properties/_search?
{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "record_status.keyword": "active"
          }
  },
  "aggs": {
    "my_buckets": {
      "composite": {
        "size": 500,
        "sources": [
          {
            "product": {
              "terms": {
                "field": "unique_address_key.keyword"
              }
            }
          }
        ]
      },
      "aggs": {
        "top_sales_hits": {
          "top_hits": {
            "sort": [
              {
                "list_date": {
                  "order": "desc"
                }
              }
            ],
            "_source": {
              "includes": [
                "display_address"
              ]
            },
            "size": 1
          }
        },
        "aggs": {
          "filter": {
            "bucket_selector": {
              "buckets_path": {
                "doc_count": "top_hits.doc_count"
              },
              "script": "_count > 2"
            }
          }
        }
      }
    }
  }
}

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.