What is best way to design for and query similar names and return a single record
Homer James Simpson vs Homer James Simpson
Homer James Simpson vs Homer J Simpson
Homer James Simpson vs H J Simpson
Orgs / Businesses
Milwaukee Area Technical College vs Milwaukee Area Technical College
Milwaukee Area Technical College vs Milwaukee Area Tech College
Milwaukee Area Technical College vs Milwaukee Area College
The objective is that if any of these similar terms were queried, the same single record would return
Prior similar topic
Similar to this topic that seems to have been unanswered from 2014, I am curious what is the best way to define an index mapping to support similar name matches:
this is an extremely broad question, and the answer basically contains a couple of books - I'll try to keep it short at the expense of gross simplification.
First, you have not specified what is the query and what is the document, so it is hard to figure out what you are after. Also you mentioned a phrase query, which searches for terms next to each other, which is something completely different than a 'similar` query.
So the question is, what do you want to archive. Searching only for a partial match, like in your second name search? You may want to take a look at the
match query in combination with
minimum_should_match. You also may want to take a look at the
boolean query to score exact matches higher. Next, you may want to take a look at synonyms to match terms that are semantically similar, but written very different. Then, you may want to take typos into account by applying some fuzziness.
Searching for names comes with its own complexities compared to regular full text search. You may want to take a look at the phonetic analysis plugin, which allows to search for similar names using phonetic algorithms.
Long story short: If you are searching for one out of the box query to answer all your questions you will not find it, you need to understand the different ways of indexing fields into elasticsearch, how single terms are stored in the inverted index (think ngrams, edge ngrams, synonyms on top of splitting each value in a field) and how queries affect your scoring and the matching of documents.
P.S. Despite its age, I highly recommend reading the definitive guide, as there is a whole chapter about scoring, improving relevancy and so forth, See https://www.elastic.co/guide/en/elasticsearch/guide/current/index.html - a book like relevant search (released at manning) might also be interesting for you.
Agree with all Alexander's points here.
I generally see 3 strategies for comparing names:
- How they look (Michael == Micheal)
- How they sound - (Mark == Marc)
- How they're used - (Mike == Michael)
Generally elasticsearch has out-of-the-box support for 1) and 2) with fuzzy matching and phonetic analysis.
Option 3 requires a form of thesaurus and while we have support for synonym files these are not populated with lists of common names.
It's important to point out that you shouldn't necessarily opt for a single matching strategy - you can use all 3 together as
should clauses in a
bool query - the more clauses match, the merrier.
You may also be interested in some commercial offerings specialising in name matching that employ all these techniques and more. Basis have an elasticsearch plugin and you can trial their name matching on their website
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.