We have a legacy database of addresses that contains data from multiple countries and is quite frankly, a mess! We have lots of duplicates, lots of entries where the same postcode is entered with and without spaces and other variations of formatting and I'm looking for any help with best practice on indexing the data in a way that makes searching it easier and more useful.
I think one of the reasons we have duplicates around the postcode is when someone searches for A112AB it's not matching A1 12AB, but I'm pretty sure we're not the first people to have this issue so I was wondering if anyone has a set of best practices for handling legacy data sets like this?
Firstly, for common misspellings, there's always fuzzy search that can let you find results even where the words are a few changes away from the actual term. (see this opster blog for more info on customizing these as a starting point)
On the other hand, for your example with the IDs - since neither are actual words, the fuzzy search might not do much.
I'd recommend taking a look at the various options for cleaning data within the text analyzers. Especially either stemming and tokenizing your semantic data to match more similar phrases to the same concepts; or defining your own synonims.
The example there was ipod, i-pod, i pod => ipod, i-pod, i pod which you could set up for some of those postal codes or more common and well-known mistakes you've noticed.
For the mistakes you may not have found yet (the ones you'd actually need most help with) - this may be more of a data analysis or even ML issue. Like looking at the value distribution across each data category and finding outliers; clustering similar values; or even some classification? You might find the more like this query useful to find more of the "false duplicates" you mentioned.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.