Best practice for legacy address database searches

Stubbs · September 19, 2024, 2:49pm

We have a legacy database of addresses that contains data from multiple countries and is quite frankly, a mess! We have lots of duplicates, lots of entries where the same postcode is entered with and without spaces and other variations of formatting and I'm looking for any help with best practice on indexing the data in a way that makes searching it easier and more useful.

I think one of the reasons we have duplicates around the postcode is when someone searches for A112AB it's not matching A1 12AB, but I'm pretty sure we're not the first people to have this issue so I was wondering if anyone has a set of best practices for handling legacy data sets like this?

iulia · September 19, 2024, 6:01pm

Hi!

There are indeed a few things you could look at.

Firstly, for common misspellings, there's always fuzzy search that can let you find results even where the words are a few changes away from the actual term. (see this opster blog for more info on customizing these as a starting point)

On the other hand, for your example with the IDs - since neither are actual words, the fuzzy search might not do much.

I'd recommend taking a look at the various options for cleaning data within the text analyzers. Especially either stemming and tokenizing your semantic data to match more similar phrases to the same concepts; or defining your own synonims.
The example there was ipod, i-pod, i pod => ipod, i-pod, i pod which you could set up for some of those postal codes or more common and well-known mistakes you've noticed.

For the mistakes you may not have found yet (the ones you'd actually need most help with) - this may be more of a data analysis or even ML issue. Like looking at the value distribution across each data category and finding outliers; clustering similar values; or even some classification? You might find the more like this query useful to find more of the "false duplicates" you mentioned.

Hope this helps as a starting point!

Topic		Replies	Views
Need help to apply Analyzers Elasticsearch	6	697	July 6, 2017
Prefix searching across several string fields with different boosts? Elasticsearch	3	312	July 6, 2017
Search issues - beginner user in need of guidance :) Elasticsearch	2	325	October 24, 2019
Help with aggregation to identify dups Elasticsearch	3	1063	March 4, 2019
Boosting Boolean Should's & Best Approach for Address Elasticsearch	1	339	February 7, 2022

Best practice for legacy address database searches

Related topics