Hi Michael
As a user types information about a member I need to provide a list of
suggestions based on the ...: member first/last name, his member id
and date of birth.
It is my understanding that I need to index my membership info (~100M
records) so that I can use edit distance and metaphone against first
name/ last name. I also would like to use a list of synonyms for the
first name ('Bob' vs 'Robert').
Yes - preparing your data correctly is essential.
It also makes sense to use edit
distance against member id - to account for typing errors.
Do you really think this is the case? If somebody is typing a member ID,
then I think they should see JUST the associated user. Otherwise, if
all I have is the member ID, I type that, and it shows me 20 different
users, how do I know which is the one I want? I'll ignore this
requirement.
Date of birth I would like to use as a filter - if it is there I only
show members matching it.
Also to further narrow the list I would only include members who live
within certain distance of the service location.
I apologise in advance - this email is long, but is well worth reading
(and I should probably turn it into a tutorial, as this question is
asked often):
OK, so there are two phases here:
- preparing your data, and
- searching
PREPARING YOUR DATA:
First, let's decide how each field needs to be indexed, then we can look
at what analyzers we need to provide.
Elasticsearch Platform — Find real-time answers at scale | Elastic
- we want to include metaphones
Elasticsearch Platform — Find real-time answers at scale | Elastic
Elasticsearch Platform — Find real-time answers at scale | Elastic
Elasticsearch Platform — Find real-time answers at scale | Elastic
- member id
-
you didn't specify if this is an numeric or alphanumeric, so I'll
just assume alphanumeric, possibly with punctuation, eg
"ABC-1234"
-
let's say that we want to tokenize this as "abc","1234", so we'll
use the "simple" analyzer
Elasticsearch Platform — Find real-time answers at scale | Elastic
-
birthday
- this is just a date field, no analysis needed
-
location
- this is a geo_point field, no analysis needed
So we have a list of custom analyzers we need to define:
-
full_name:
- standard token filter
- lowercase
- ascii folding
-
partial_name:
- standard token filter
- lowercase
- ascii folding
- synonyms
- edge ngrams
-
name_metaphone:
- standard token filter
- phonetic/metaphone filter
Here is the command to create the index with the above analyzers and
mapping: Create index for partial matching of names in ElasticSearch · GitHub
It is quite long, but just goes through the process listed above. If you
look at each block, it's actually quite simple.
NOTES:
- For first_name/last_name, I am using multi-fields. The "main"
sub-field has the same name as the top level, so that if I refer
to "first_name" it automatically references "first_name.first_name"
So in effect, I have "first_name" and "first_name.partial" and
"first_name.metaphone"
- In the partial name fields, I am using index_analyzer and
search_analyzer.
Normally, you want your data and search terms to use the same
analyzer - this ensures that you are searching for the same
terms that are actually stored in ES. For example, in the
first_name.metaphone field, I just specify an 'analyzer'
(which sets both the search_analyzer and index_analyzer to
the same value)
However, for the partial field, we want them to be different. If we
store the name "Clinton", we want to be able to use auto-complete
for search terms like 'clin' (ie partial matches). So at index time,
we tokenize clinton as c,cl,cli,clin,clint,clinto,clinton
However, when we search, we don't want 'clin' to match 'cat','cliff'
etc. So we DON'T want to use the ngram tokenizer on search terms.
So, run the commands in the gist above, and then you can experiment with
searches.
You can see what tokens each analyzer produces with the analyze API.
Try these queries:
curl -XGET '
http://127.0.0.1:9200/test/_analyze?pretty=1&text=rob&analyzer=full_name'
curl -XGET '
http://127.0.0.1:9200/test/_analyze?pretty=1&text=rob&analyzer=partial_name
'
curl -XGET '
http://127.0.0.1:9200/test/_analyze?pretty=1&text=rob&analyzer=name_metaphone
'
and to check that the ascii folding is working, try 'sánchez' (but URL
encoded):
curl -XGET '
http://127.0.0.1:9200/test/_analyze?pretty=1&text=sánchez&analyzer=partial_name
'
SEARCHING YOUR DATA
Let's get rid of the easy stuff first:
birthday:
If your user enters a birthday, then you want to filter the
results to only include members with a matching birthday:
{ term: { birthday: '1970-10-24' }}
location:
Use a geo_distance filter to find results within 100km of
London:
{ geo_distance: {
distance: "100km",
location: [51.50853, -0.12574]
}}
member_id:
This can use a simple text query:
{ text: { member_id: "abc-1234" }}
OK - now the more interesting stuff: first name and last name.
The logic we want to use here is:
Show me any name whose first or last name field matches
completely or partially, but consider full word matches
to be more relevant than partial matches or metaphone
matches
We're going to combine these queries using the 'bool' query. The
difference between the 'bool' query and the 'dismax' query is that the
'bool' query combines the _score/relevance of each matching clause,
while the 'dismax' query chooses the highest _score from the matching
clauses.
{ bool:
{ should: [
{ text: { "first_name": "rob" }}, # full name
{ text: { "first_name.partial": "rob" }} # partial match
{ text: { "first_name.metaphone": "rob"}} # metaphone
]
}}
This will find all docs that match any of the above clauses.
The _score of each matching clause is combined, so a doc which matches
all 3 clauses will rank higher than a doc that matches just one clause,
so we already have some ranking here.
But lets say that we wanted a full word match to be significantly more
relevant than the other two. We can change that clause to:
{ text: { "first_name": {
query: "rob",
boost: 1
}}}
Of course, we need to include the same 3 clauses for "last_name" as
well.
Now, to turn these into a query that we can pass to the Search API:
All search queries must be wrapped in a top-level { query: {....}}
element, which will contain one of 3 possibilities:
- just the bool query
{ query: { bool: {...}}
Example name query · GitHub
- just one or more filters
{ query: { constant_score: { filter: {....} }}
Filter just by name and geo distance · GitHub
- the bool query combined with one or more filters
{ query: { filtered: { query: {bool: ...}, filter: {.....} }}}
Name query filtered by birthday and geo distance · GitHub
This was long, but I hope it was worth it.
If anything isn't clear, please ask, and I can improve this and turn it
into a tutorial.
clint