Hey folks,
I am hoping someone could help me figure this issue out. Been at it for a few days and feel like i have hit a wall. I'd really appreciate any help on this topic. Here is the issue -
I am populating address records into an index. The problem is that the source sends a ton of duplicates that need de-duplication based on address and city (unfortunately there is no PK other than these text fields so i understand this is not going to be the fastest query, but i am ok with the performance aspect of it for now) This deduplication doesn't need to happen in real time and we are ok doing it once a day. There are about 16 million documents and about 800k of them look like duplicates. Here is what i have done so far -
- First and foremost, i ignore any rows where the address or city is null. I also consider only those documents that have record_status set to 'active'. See step #3 for more on this flag.
- Find duplicates using the following aggregation query
- Once i get the results back in chunks, i issue an update query back to ES to flip an attribute called record_status = 'inactive' so that i exclude them in the next aggregation call.
- For whatever reason, it doesn't seem to be capturing all duplicate documents when creating buckets. I can see that a particular address + city combo exists 31 times in the index but the bucket only lists 6 of them. What's worse, when i include that particular address manually, it shows all 31 docs. What gives??
Here's the DSL query i am using
{
"size": 0,
"aggs": {
"addresses": {
"terms": {
"script" : {
"lang": "painless",
"source": "if ((doc['display_address.keyword'].value != null && doc['city.keyword'].value != null) && doc['record_status.keyword'].value == 'active') {return doc['display_address.keyword'].value + doc['city.keyword'].value;} else {return null;}"
},
"size": 500,
"exclude": "null"
},
"aggs": {
"my_filter": {
"bucket_selector": {
"buckets_path": {
"the_doc_count": "_count"
},
"script": "params.the_doc_count > 1"
}
},
"final_docs": {
"top_hits": {
"sort": [
{
"list_date": {
"order": "desc"
}
}
],
"_source": {
"includes": [
"_id",
"list_date",
"display_address",
"city"
]
},
"size": 100
}
}
}
}
}
}
This above query gives me incorrect data for several addresses. For example, for 123 Main st, Detroit comes back with only 6 documents.
Instead if I just add
"include": "123 Main st"
in my aggregation, all of a sudden i see 31 duplicates for 123 Main st, Detroit