Help needed with the query

mfeingold · July 17, 2011, 8:17pm

Can you guys help me understand how should I approach my problem. Here
is what I need to do:
As a user types information about a member I need to provide a list of
suggestions based on the provided information. The information may or
may not include the following: member first/last name, his member id
and date of birth.

It is my understanding that I need to index my membership info (~100M
records) so that I can use edit distance and metaphone against first
name/ last name. I also would like to use a list of synonyms for the
first name ('Bob' vs 'Robert'). It also makes sense to use edit
distance against member id - to account for typing errors. Date of
birth I would like to use as a filter - if it is there I only show
members matching it.
Also to further narrow the list I would only include members who live
within certain distance of the service location.

All of this is just thinking aloud (in writing?). I am not sure how
this should (or can it) be translated into ES configuration /query.
Any help?

Clinton_Gormley · July 18, 2011, 11:28am

Hi Michael

As a user types information about a member I need to provide a list of
suggestions based on the ...: member first/last name, his member id
and date of birth.

It is my understanding that I need to index my membership info (~100M
records) so that I can use edit distance and metaphone against first
name/ last name. I also would like to use a list of synonyms for the
first name ('Bob' vs 'Robert').

Yes - preparing your data correctly is essential.

It also makes sense to use edit
distance against member id - to account for typing errors.

Do you really think this is the case? If somebody is typing a member ID,
then I think they should see JUST the associated user. Otherwise, if
all I have is the member ID, I type that, and it shows me 20 different
users, how do I know which is the one I want? I'll ignore this
requirement.

Date of birth I would like to use as a filter - if it is there I only
show members matching it.

Also to further narrow the list I would only include members who live
within certain distance of the service location.

I apologise in advance - this email is long, but is well worth reading
(and I should probably turn it into a tutorial, as this question is
asked often):

OK, so there are two phases here:

preparing your data, and
searching

PREPARING YOUR DATA:

First, let's decide how each field needs to be indexed, then we can look
at what analyzers we need to provide.

first name / last name:
- these are string fields
- we want to use synonyms (eg Robert vs Bob)
  Elasticsearch Platform — Find real-time answers at scale | Elastic
- we want to include metaphones
  Elasticsearch Platform — Find real-time answers at scale | Elastic
- we want to find 'clinton' if the user types 'clin' (auto-complete), so we'll use
  edge ngrams: Elasticsearch Platform — Find real-time answers at scale | Elastic
- and let's also use ascii-folding to allow 'Ã©Ã«Ã¨' to match 'e'
  Elasticsearch Platform — Find real-time answers at scale | Elastic
- we want to do 3 types of matches:
  - most relevant: full word matches
  - less relevant: partial word matches (eg with ngrams, synonyms)
  - least relevant: metaphone matches
  so we'll index the names with three versions, as a multi-field
  Elasticsearch Platform — Find real-time answers at scale | Elastic
member id
- you didn't specify if this is an numeric or alphanumeric, so I'll
  just assume alphanumeric, possibly with punctuation, eg
  "ABC-1234"
- let's say that we want to tokenize this as "abc","1234", so we'll
  use the "simple" analyzer
  Elasticsearch Platform — Find real-time answers at scale | Elastic
birthday
- this is just a date field, no analysis needed
location
- this is a geo_point field, no analysis needed

So we have a list of custom analyzers we need to define:

full_name:
- standard token filter
- lowercase
- ascii folding
partial_name:
- standard token filter
- lowercase
- ascii folding
- synonyms
- edge ngrams
name_metaphone:
- standard token filter
- phonetic/metaphone filter

Here is the command to create the index with the above analyzers and
mapping: Create index for partial matching of names in ElasticSearch · GitHub

It is quite long, but just goes through the process listed above. If you
look at each block, it's actually quite simple.

NOTES:

For first_name/last_name, I am using multi-fields. The "main"
sub-field has the same name as the top level, so that if I refer
to "first_name" it automatically references "first_name.first_name"

So in effect, I have "first_name" and "first_name.partial" and
"first_name.metaphone"
In the partial name fields, I am using index_analyzer and
search_analyzer.

Normally, you want your data and search terms to use the same
analyzer - this ensures that you are searching for the same
terms that are actually stored in ES. For example, in the
first_name.metaphone field, I just specify an 'analyzer'
(which sets both the search_analyzer and index_analyzer to
the same value)

However, for the partial field, we want them to be different. If we
store the name "Clinton", we want to be able to use auto-complete
for search terms like 'clin' (ie partial matches). So at index time,
we tokenize clinton as c,cl,cli,clin,clint,clinto,clinton

However, when we search, we don't want 'clin' to match 'cat','cliff'
etc. So we DON'T want to use the ngram tokenizer on search terms.

So, run the commands in the gist above, and then you can experiment with
searches.

You can see what tokens each analyzer produces with the analyze API.
Try these queries:

curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=rob&analyzer=full_name'
curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=rob&analyzer=partial_name'
curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=rob&analyzer=name_metaphone'

and to check that the ascii folding is working, try 'sÃ¡nchez' (but URL encoded):

curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=sánchez&analyzer=partial_name'

SEARCHING YOUR DATA

Let's get rid of the easy stuff first:

birthday:

    If your user enters a birthday, then you want to filter the
    results to only include members with a matching birthday:
    
       { term: { birthday: '1970-10-24' }}

location:

    Use a geo_distance filter to find results within 100km of
    London:
    
       { geo_distance: { 
               distance: "100km",
               location: [51.50853, -0.12574]
       }}

member_id:

 This can use a simple text query:

 { text: { member_id: "abc-1234" }}

OK - now the more interesting stuff: first name and last name.

The logic we want to use here is:

Show me any name whose first or last name field matches
completely or partially, but consider full word matches
to be more relevant than partial matches or metaphone
matches

We're going to combine these queries using the 'bool' query. The
difference between the 'bool' query and the 'dismax' query is that the
'bool' query combines the _score/relevance of each matching clause,
while the 'dismax' query chooses the highest _score from the matching
clauses.

{ bool:
{ should: [
{ text: { "first_name": "rob" }}, # full name
{ text: { "first_name.partial": "rob" }} # partial match
{ text: { "first_name.metaphone": "rob"}} # metaphone
]
}}

This will find all docs that match any of the above clauses.

The _score of each matching clause is combined, so a doc which matches
all 3 clauses will rank higher than a doc that matches just one clause,
so we already have some ranking here.

But lets say that we wanted a full word match to be significantly more
relevant than the other two. We can change that clause to:

{ text: { "first_name": {
query: "rob",
boost: 1
}}}

Of course, we need to include the same 3 clauses for "last_name" as
well.

Now, to turn these into a query that we can pass to the Search API:

All search queries must be wrapped in a top-level { query: {....}}
element, which will contain one of 3 possibilities:

just the bool query
{ query: { bool: {...}}

Example name query · GitHub
just one or more filters
{ query: { constant_score: { filter: {....} }}

Filter just by name and geo distance · GitHub
the bool query combined with one or more filters
{ query: { filtered: { query: {bool: ...}, filter: {.....} }}}

Name query filtered by birthday and geo distance · GitHub

This was long, but I hope it was worth it.

If anything isn't clear, please ask, and I can improve this and turn it
into a tutorial.

clint

mfeingold · July 18, 2011, 4:16pm

Hi Clinton:

Thanks for a quick and detailed response.

To clarify a few points:

The synonyms - I only need the synonyms on the first name, I
think. I do not imagine it to be of much use for last names. I am not
sure if dropping synonyms for the last name would have any impact in
terms of performance or necessary disk space. Based on your templates
I hope I understand how to do this.
Names - one of the problems I foresee is stemming from the fact
that I want a single input string with all the search parameters
(except geoloc). The problem is to tell apart the first name from the
last name. Additional complication here is that a name (both first and
last) can consist of several words. My hope was that I can build
indexes/queries in such a way so that I can throw both names as a
single string at the query leaving it to the ES to figure out which
one is which
Edge ngrams - I would like to limit wild card search to first let
us say 6 chars, assuming that more than that implies exact match. my
clumsy experiments with the ES made me think that if ngrams are in the
play, they have to be all the way through the max length of the field,
otherwise the search misses exact matches. I was doing something
wrong, I hope
Member ID - it is alphanumeric. I still think that some degree of
fuzziness can help here. The idea is to let the user type anything he
knows and provide autosuggest as he types. So if he knows the ID -
great it should be an immediate hit, but if he mistyped it and also
provided last name - it still can let me to make a pretty well
educated guess. So why do not do it? I think though that the edit
distance allowed here should be minimal.
Geo - I am curious about the performance of this. Does it really
does sqrt from sum of squares? Does it mean that during this process
it actually loops through all documents to find the matching ones? The
reason I am asking is that I do not insist on the circle - I can get
away with a square, I mean instead of real geo distance here - I can
use a range on both longitude and latitude. Would it be faster?

On Jul 18, 6:28 am, Clinton Gormley clin...@iannounce.co.uk wrote:

Hi Michael

As a user types information about a member I need to provide a list of
suggestions based on the ...: member first/last name, his member id
and date of birth.

It is my understanding that I need to index my membership info (~100M
records) so that I can use edit distance and metaphone against first
name/ last name. I also would like to use a list of synonyms for the
first name ('Bob' vs 'Robert').

Yes - preparing your data correctly is essential.

It also makes sense to use edit
distance against member id - to account for typing errors.

Do you really think this is the case? If somebody is typing a member ID,
then I think they should see JUST the associated user. Otherwise, if
all I have is the member ID, I type that, and it shows me 20 different
users, how do I know which is the one I want? I'll ignore this
requirement.

Date of birth I would like to use as a filter - if it is there I only
show members matching it.
Also to further narrow the list I would only include members who live
within certain distance of the service location.

I apologise in advance - this email is long, but is well worth reading
(and I should probably turn it into a tutorial, as this question is
asked often):

OK, so there are two phases here:

preparing your data, and

searching

PREPARING YOUR DATA:

First, let's decide how each field needs to be indexed, then we can look
at what analyzers we need to provide.

first name / last name:

these are string fields

we want to use synonyms (eg Robert vs Bob)
Elasticsearch Platform — Find real-time answers at scale | Elastic...

we want to include metaphones
Elasticsearch Platform — Find real-time answers at scale | Elastic...

we want to find 'clinton' if the user types 'clin' (auto-complete), so we'll use
edge ngrams:Elasticsearch Platform — Find real-time answers at scale | Elastic...

and let's also use ascii-folding to allow 'éëè' to match 'e'
Elasticsearch Platform — Find real-time answers at scale | Elastic...

we want to do 3 types of matches:

most relevant: full word matches

less relevant: partial word matches (eg with ngrams, synonyms)

least relevant: metaphone matches

so we'll index the names with three versions, as a multi-field
Elasticsearch Platform — Find real-time answers at scale | Elastic...

member id

you didn't specify if this is an numeric or alphanumeric, so I'll
just assume alphanumeric, possibly with punctuation, eg
"ABC-1234"

let's say that we want to tokenize this as "abc","1234", so we'll
use the "simple" analyzer
Elasticsearch Platform — Find real-time answers at scale | Elastic...

birthday

this is just a date field, no analysis needed

location

this is a geo_point field, no analysis needed

So we have a list of custom analyzers we need to define:

full_name:

standard token filter

lowercase

ascii folding

partial_name:

standard token filter

lowercase

ascii folding

synonyms

edge ngrams

name_metaphone:

standard token filter

phonetic/metaphone filter

Here is the command to create the index with the above analyzers and
mapping:Create index for partial matching of names in ElasticSearch · GitHub

It is quite long, but just goes through the process listed above. If you
look at each block, it's actually quite simple.

NOTES:

For first_name/last_name, I am using multi-fields. The "main"
sub-field has the same name as the top level, so that if I refer
to "first_name" it automatically references "first_name.first_name"

So in effect, I have "first_name" and "first_name.partial" and
"first_name.metaphone"

In the partial name fields, I am using index_analyzer and
search_analyzer.

Normally, you want your data and search terms to use the same
analyzer - this ensures that you are searching for the same
terms that are actually stored in ES. For example, in the
first_name.metaphone field, I just specify an 'analyzer'
(which sets both the search_analyzer and index_analyzer to
the same value)

However, for the partial field, we want them to be different. If we
store the name "Clinton", we want to be able to use auto-complete
for search terms like 'clin' (ie partial matches). So at index time,
we tokenize clinton as c,cl,cli,clin,clint,clinto,clinton

However, when we search, we don't want 'clin' to match 'cat','cliff'
etc. So we DON'T want to use the ngram tokenizer on search terms.

So, run the commands in the gist above, and then you can experiment with
searches.

You can see what tokens each analyzer produces with the analyze API.
Try these queries:

curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=rob&analyzer=full_n...
curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=rob&analyzer=partia...
curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=rob&analyzer=name_m...

and to check that the ascii folding is working, try 'sánchez' (but URL encoded):

curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=sánchez&analyz...

SEARCHING YOUR DATA

Let's get rid of the easy stuff first:

birthday:
    If your user enters a birthday, then you want to filter the
    results to only include members with a matching birthday:

       { term: { birthday: '1970-10-24' }}
location:
    Use a geo_distance filter to find results within 100km of
    London:

       { geo_distance: {
               distance: "100km",
               location: [51.50853, -0.12574]
       }}
member_id:
 This can use a simple text query:

 { text: { member_id: "abc-1234" }}
OK - now the more interesting stuff: first name and last name.

The logic we want to use here is:

Show me any name whose first or last name field matches
completely or partially, but consider full word matches
to be more relevant than partial matches or metaphone
matches

We're going to combine these queries using the 'bool' query. The
difference between the 'bool' query and the 'dismax' query is that the
'bool' query combines the _score/relevance of each matching clause,
while the 'dismax' query chooses the highest _score from the matching
clauses.

{ bool:
{ should: [
{ text: { "first_name": "rob" }}, # full name
{ text: { "first_name.partial": "rob" }} # partial match
{ text: { "first_name.metaphone": "rob"}} # metaphone
]

}}

This will find all docs that match any of the above clauses.

The _score of each matching clause is combined, so a doc which matches
all 3 clauses will rank higher than a doc that matches just one clause,
so we already have some ranking here.

But lets say that we wanted a full word match to be significantly more
relevant than the other two. We can change that clause to:

{ text: { "first_name": {
query: "rob",
boost: 1
}}}

Of course, we need to include the same 3 clauses for "last_name" as
well.

Now, to turn these into a query that we can pass to the Search API:

All search queries must be wrapped in a top-level { query: {....}}
element, which will contain one of 3 possibilities:

just the bool query
{ query: { bool: {...}}

Example name query · GitHub

just one or more filters
{ query: { constant_score: { filter: {....} }}

Filter just by name and geo distance · GitHub

the bool query combined with one or more filters
{ query: { filtered: { query: {bool: ...}, filter: {.....} }}}

Name query filtered by birthday and geo distance · GitHub

This was long, but I hope it was worth it.

If anything isn't clear, please ask, and I can improve this and turn it
into a tutorial.

clint

Clinton_Gormley · July 18, 2011, 4:44pm

Hi Michael

The synonyms - I only need the synonyms on the first name, I
think. I do not imagine it to be of much use for last names. I am not
sure if dropping synonyms for the last name would have any impact in
terms of performance or necessary disk space. Based on your templates
I hope I understand how to do this.

That's fine, just define one custom analyzer with synonyms, and use that
for first names, and another without synonyms, for the last name.

Names - one of the problems I foresee is stemming from the fact
that I want a single input string with all the search parameters
(except geoloc). The problem is to tell apart the first name from the
last name. Additional complication here is that a name (both first and
last) can consist of several words. My hope was that I can build
indexes/queries in such a way so that I can throw both names as a
single string at the query leaving it to the ES to figure out which
one is which

In my example, "rob smith" looks for "rob OR smith", and you're running
that against first name and last name, so you can use the same search
string. It will find 'rob' in first name and 'smith' in last name.

Edge ngrams - I would like to limit wild card search to first let
us say 6 chars, assuming that more than that implies exact match. my
clumsy experiments with the ES made me think that if ngrams are in the
play, they have to be all the way through the max length of the field,
otherwise the search misses exact matches. I was doing something
wrong, I hope

You can limit the ngrams, but there is no real reason to do so. Also,
you have the full word version of the field which it will match against.
In the bool query I'm using 'should' (which is like 'or') so not all
fields need to match.

Member ID - it is alphanumeric. I still think that some degree of
fuzziness can help here. The idea is to let the user type anything he
knows and provide autosuggest as he types. So if he knows the ID -
great it should be an immediate hit, but if he mistyped it and also
provided last name - it still can let me to make a pretty well
educated guess. So why do not do it? I think though that the edit
distance allowed here should be minimal.

That's probably fine - with the text query you can use the 'fuzziness'
parameter,

Geo - I am curious about the performance of this. Does it really
does sqrt from sum of squares? Does it mean that during this process
it actually loops through all documents to find the matching ones? The
reason I am asking is that I do not insist on the circle - I can get
away with a square, I mean instead of real geo distance here - I can
use a range on both longitude and latitude. Would it be faster?

geo_distance is fast. no idea how it works internally, but no worries
there

clint

On Jul 18, 6:28 am, Clinton Gormley clin...@iannounce.co.uk wrote:
Hi Michael

As a user types information about a member I need to provide a list of
suggestions based on the ...: member first/last name, his member id
and date of birth.

It is my understanding that I need to index my membership info (~100M
records) so that I can use edit distance and metaphone against first
name/ last name. I also would like to use a list of synonyms for the
first name ('Bob' vs 'Robert').

Yes - preparing your data correctly is essential.

It also makes sense to use edit
distance against member id - to account for typing errors.

Do you really think this is the case? If somebody is typing a member ID,
then I think they should see JUST the associated user. Otherwise, if
all I have is the member ID, I type that, and it shows me 20 different
users, how do I know which is the one I want? I'll ignore this
requirement.

Date of birth I would like to use as a filter - if it is there I only
show members matching it.
Also to further narrow the list I would only include members who live
within certain distance of the service location.

I apologise in advance - this email is long, but is well worth reading
(and I should probably turn it into a tutorial, as this question is
asked often):

OK, so there are two phases here:

preparing your data, and

searching

PREPARING YOUR DATA:

First, let's decide how each field needs to be indexed, then we can look
at what analyzers we need to provide.

first name / last name:

these are string fields

we want to use synonyms (eg Robert vs Bob)
Elasticsearch Platform — Find real-time answers at scale | Elastic...

we want to include metaphones
Elasticsearch Platform — Find real-time answers at scale | Elastic...

we want to find 'clinton' if the user types 'clin' (auto-complete), so we'll use
edge ngrams:Elasticsearch Platform — Find real-time answers at scale | Elastic...

and let's also use ascii-folding to allow 'Ã©Ã«Ã¨' to match 'e'
Elasticsearch Platform — Find real-time answers at scale | Elastic...

we want to do 3 types of matches:

most relevant: full word matches

less relevant: partial word matches (eg with ngrams, synonyms)

least relevant: metaphone matches

so we'll index the names with three versions, as a multi-field
Elasticsearch Platform — Find real-time answers at scale | Elastic...

member id

you didn't specify if this is an numeric or alphanumeric, so I'll
just assume alphanumeric, possibly with punctuation, eg
"ABC-1234"

let's say that we want to tokenize this as "abc","1234", so we'll
use the "simple" analyzer
Elasticsearch Platform — Find real-time answers at scale | Elastic...

birthday

this is just a date field, no analysis needed

location

this is a geo_point field, no analysis needed

So we have a list of custom analyzers we need to define:

full_name:

standard token filter

lowercase

ascii folding

partial_name:

standard token filter

lowercase

ascii folding

synonyms

edge ngrams

name_metaphone:

standard token filter

phonetic/metaphone filter

Here is the command to create the index with the above analyzers and
mapping:Create index for partial matching of names in ElasticSearch · GitHub

It is quite long, but just goes through the process listed above. If you
look at each block, it's actually quite simple.

NOTES:

For first_name/last_name, I am using multi-fields. The "main"
sub-field has the same name as the top level, so that if I refer
to "first_name" it automatically references "first_name.first_name"

So in effect, I have "first_name" and "first_name.partial" and
"first_name.metaphone"

In the partial name fields, I am using index_analyzer and
search_analyzer.

Normally, you want your data and search terms to use the same
analyzer - this ensures that you are searching for the same
terms that are actually stored in ES. For example, in the
first_name.metaphone field, I just specify an 'analyzer'
(which sets both the search_analyzer and index_analyzer to
the same value)

However, for the partial field, we want them to be different. If we
store the name "Clinton", we want to be able to use auto-complete
for search terms like 'clin' (ie partial matches). So at index time,
we tokenize clinton as c,cl,cli,clin,clint,clinto,clinton

However, when we search, we don't want 'clin' to match 'cat','cliff'
etc. So we DON'T want to use the ngram tokenizer on search terms.

So, run the commands in the gist above, and then you can experiment with
searches.

You can see what tokens each analyzer produces with the analyze API.
Try these queries:

curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=rob&analyzer=full_n...
curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=rob&analyzer=partia...
curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=rob&analyzer=name_m...

and to check that the ascii folding is working, try 'sÃ¡nchez' (but URL encoded):

curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=sánchez&analyz...

SEARCHING YOUR DATA

Let's get rid of the easy stuff first:

birthday:
    If your user enters a birthday, then you want to filter the
    results to only include members with a matching birthday:

       { term: { birthday: '1970-10-24' }}
location:
    Use a geo_distance filter to find results within 100km of
    London:

       { geo_distance: {
               distance: "100km",
               location: [51.50853, -0.12574]
       }}
member_id:
 This can use a simple text query:

 { text: { member_id: "abc-1234" }}
OK - now the more interesting stuff: first name and last name.

The logic we want to use here is:

Show me any name whose first or last name field matches
completely or partially, but consider full word matches
to be more relevant than partial matches or metaphone
matches

We're going to combine these queries using the 'bool' query. The
difference between the 'bool' query and the 'dismax' query is that the
'bool' query combines the _score/relevance of each matching clause,
while the 'dismax' query chooses the highest _score from the matching
clauses.

{ bool:
{ should: [
{ text: { "first_name": "rob" }}, # full name
{ text: { "first_name.partial": "rob" }} # partial match
{ text: { "first_name.metaphone": "rob"}} # metaphone
]

}}

This will find all docs that match any of the above clauses.

The _score of each matching clause is combined, so a doc which matches
all 3 clauses will rank higher than a doc that matches just one clause,
so we already have some ranking here.

But lets say that we wanted a full word match to be significantly more
relevant than the other two. We can change that clause to:

{ text: { "first_name": {
query: "rob",
boost: 1
}}}

Of course, we need to include the same 3 clauses for "last_name" as
well.

Now, to turn these into a query that we can pass to the Search API:

All search queries must be wrapped in a top-level { query: {....}}
element, which will contain one of 3 possibilities:

just the bool query
{ query: { bool: {...}}

Example name query · GitHub

just one or more filters
{ query: { constant_score: { filter: {....} }}

Filter just by name and geo distance · GitHub

the bool query combined with one or more filters
{ query: { filtered: { query: {bool: ...}, filter: {.....} }}}

Name query filtered by birthday and geo distance · GitHub

This was long, but I hope it was worth it.

If anything isn't clear, please ask, and I can improve this and turn it
into a tutorial.

clint

--
Web Announcements Limited is a company registered in England and Wales,
with company number 05608868, with registered address at 10 Arvon Road,
London, N5 1PR.

kimchy · July 18, 2011, 5:29pm

I suggest you start with mapping. Start with a simple one, where you have
mappings set for different elements (first name, last name), with custom
analyzers that you define that use what you want. You might need to use
multi_field mapping type if you want to have several analyzers applied to
the same field.

On Sun, Jul 17, 2011 at 11:17 PM, Michael Feingold mfeingold@hill30.comwrote:

Can you guys help me understand how should I approach my problem. Here
is what I need to do:
As a user types information about a member I need to provide a list of
suggestions based on the provided information. The information may or
may not include the following: member first/last name, his member id
and date of birth.

It is my understanding that I need to index my membership info (~100M
records) so that I can use edit distance and metaphone against first
name/ last name. I also would like to use a list of synonyms for the
first name ('Bob' vs 'Robert'). It also makes sense to use edit
distance against member id - to account for typing errors. Date of
birth I would like to use as a filter - if it is there I only show
members matching it.
Also to further narrow the list I would only include members who live
within certain distance of the service location.

All of this is just thinking aloud (in writing?). I am not sure how
this should (or can it) be translated into ES configuration /query.
Any help?

lalit_mishra · July 19, 2011, 3:15pm

Hi Clinton,
In place of ngram can I use prefixQuery to serve the purpose is there any
advantage of using ngram tokenizer?

Below configuration is an example
{
"tweet" : {
"properties" : {
"shortName" : {
"type" : "multi_field",
"fields" : {
"name" : {"type" : "string", "index" : "analyzed"},
"untouched" : {"type" : "string", "index" :
"not_analyzed"}
}
}
}
}
}

query name.untouched for exact search using textPhrase
and prefix query for partial search

Please let me know if you thing otherwise

Thanks,
Lalit.

On Mon, Jul 18, 2011 at 4:58 PM, Clinton Gormley clinton@iannounce.co.ukwrote:

Hi Michael

As a user types information about a member I need to provide a list of
suggestions based on the ...: member first/last name, his member id
and date of birth.

It is my understanding that I need to index my membership info (~100M
records) so that I can use edit distance and metaphone against first
name/ last name. I also would like to use a list of synonyms for the
first name ('Bob' vs 'Robert').

Yes - preparing your data correctly is essential.

It also makes sense to use edit
distance against member id - to account for typing errors.

Do you really think this is the case? If somebody is typing a member ID,
then I think they should see JUST the associated user. Otherwise, if
all I have is the member ID, I type that, and it shows me 20 different
users, how do I know which is the one I want? I'll ignore this
requirement.

Date of birth I would like to use as a filter - if it is there I only
show members matching it.

Also to further narrow the list I would only include members who live
within certain distance of the service location.

I apologise in advance - this email is long, but is well worth reading
(and I should probably turn it into a tutorial, as this question is
asked often):

OK, so there are two phases here:

preparing your data, and

searching

PREPARING YOUR DATA:

First, let's decide how each field needs to be indexed, then we can look
at what analyzers we need to provide.

first name / last name:

these are string fields

we want to use synonyms (eg Robert vs Bob)

Elasticsearch Platform — Find real-time answers at scale | Elastic

we want to include metaphones

Elasticsearch Platform — Find real-time answers at scale | Elastic

we want to find 'clinton' if the user types 'clin' (auto-complete), so
we'll use
edge ngrams:
Elasticsearch Platform — Find real-time answers at scale | Elastic

and let's also use ascii-folding to allow 'éëè' to match 'e'

Elasticsearch Platform — Find real-time answers at scale | Elastic

we want to do 3 types of matches:

most relevant: full word matches

less relevant: partial word matches (eg with ngrams, synonyms)

least relevant: metaphone matches

so we'll index the names with three versions, as a multi-field

Elasticsearch Platform — Find real-time answers at scale | Elastic

member id

you didn't specify if this is an numeric or alphanumeric, so I'll
just assume alphanumeric, possibly with punctuation, eg
"ABC-1234"

let's say that we want to tokenize this as "abc","1234", so we'll
use the "simple" analyzer

Elasticsearch Platform — Find real-time answers at scale | Elastic

birthday

this is just a date field, no analysis needed

location

this is a geo_point field, no analysis needed

So we have a list of custom analyzers we need to define:

full_name:

standard token filter

lowercase

ascii folding

partial_name:

standard token filter

lowercase

ascii folding

synonyms

edge ngrams

name_metaphone:

standard token filter

phonetic/metaphone filter

Here is the command to create the index with the above analyzers and
mapping: Create index for partial matching of names in ElasticSearch · GitHub

It is quite long, but just goes through the process listed above. If you
look at each block, it's actually quite simple.

NOTES:

For first_name/last_name, I am using multi-fields. The "main"
sub-field has the same name as the top level, so that if I refer
to "first_name" it automatically references "first_name.first_name"

So in effect, I have "first_name" and "first_name.partial" and
"first_name.metaphone"

In the partial name fields, I am using index_analyzer and
search_analyzer.

Normally, you want your data and search terms to use the same
analyzer - this ensures that you are searching for the same
terms that are actually stored in ES. For example, in the
first_name.metaphone field, I just specify an 'analyzer'
(which sets both the search_analyzer and index_analyzer to
the same value)

However, for the partial field, we want them to be different. If we
store the name "Clinton", we want to be able to use auto-complete
for search terms like 'clin' (ie partial matches). So at index time,
we tokenize clinton as c,cl,cli,clin,clint,clinto,clinton

However, when we search, we don't want 'clin' to match 'cat','cliff'
etc. So we DON'T want to use the ngram tokenizer on search terms.

So, run the commands in the gist above, and then you can experiment with
searches.

You can see what tokens each analyzer produces with the analyze API.
Try these queries:

curl -XGET '
http://127.0.0.1:9200/test/_analyze?pretty=1&text=rob&analyzer=full_name'
curl -XGET '
http://127.0.0.1:9200/test/_analyze?pretty=1&text=rob&analyzer=partial_name
'
curl -XGET '
http://127.0.0.1:9200/test/_analyze?pretty=1&text=rob&analyzer=name_metaphone
'

and to check that the ascii folding is working, try 'sánchez' (but URL
encoded):

curl -XGET '
http://127.0.0.1:9200/test/_analyze?pretty=1&text=sánchez&analyzer=partial_name
'

SEARCHING YOUR DATA

Let's get rid of the easy stuff first:

birthday:
   If your user enters a birthday, then you want to filter the
   results to only include members with a matching birthday:

      { term: { birthday: '1970-10-24' }}
location:
   Use a geo_distance filter to find results within 100km of
   London:

      { geo_distance: {
              distance: "100km",
              location: [51.50853, -0.12574]
      }}
member_id:
This can use a simple text query:

{ text: { member_id: "abc-1234" }}
OK - now the more interesting stuff: first name and last name.

The logic we want to use here is:

Show me any name whose first or last name field matches
completely or partially, but consider full word matches
to be more relevant than partial matches or metaphone
matches

We're going to combine these queries using the 'bool' query. The
difference between the 'bool' query and the 'dismax' query is that the
'bool' query combines the _score/relevance of each matching clause,
while the 'dismax' query chooses the highest _score from the matching
clauses.

{ bool:
{ should: [
{ text: { "first_name": "rob" }}, # full name
{ text: { "first_name.partial": "rob" }} # partial match
{ text: { "first_name.metaphone": "rob"}} # metaphone
]
}}

This will find all docs that match any of the above clauses.

The _score of each matching clause is combined, so a doc which matches
all 3 clauses will rank higher than a doc that matches just one clause,
so we already have some ranking here.

But lets say that we wanted a full word match to be significantly more
relevant than the other two. We can change that clause to:

{ text: { "first_name": {
query: "rob",
boost: 1
}}}

Of course, we need to include the same 3 clauses for "last_name" as
well.

Now, to turn these into a query that we can pass to the Search API:

All search queries must be wrapped in a top-level { query: {....}}
element, which will contain one of 3 possibilities:

just the bool query
{ query: { bool: {...}}

Example name query · GitHub

just one or more filters
{ query: { constant_score: { filter: {....} }}

Filter just by name and geo distance · GitHub

the bool query combined with one or more filters
{ query: { filtered: { query: {bool: ...}, filter: {.....} }}}

Name query filtered by birthday and geo distance · GitHub

This was long, but I hope it was worth it.

If anything isn't clear, please ask, and I can improve this and turn it
into a tutorial.

clint

Clinton_Gormley · July 19, 2011, 3:23pm

Hi Lalit

In place of ngram can I use prefixQuery to serve the purpose is there
any advantage of using ngram tokenizer?

Performance. The prefix query is easy to use, but nowhere near as
efficient. first it needs to find all terms which might match, then run
queries on all of those. And, you may have too many matching terms,
etc.

So prefix query is fine for small numbers of terms, but ngrams will
scale

clint

lalit_mishra · July 19, 2011, 4:19pm

Thanks Clinton for quick response.

My knowledge with edge ngram is very less can you please put somelight over
what actually is edge ngram ?
Even I would like to use edge ngram

Thanks,
Lalit.

On Tue, Jul 19, 2011 at 8:53 PM, Clinton Gormley clinton@iannounce.co.ukwrote:

Hi Lalit

In place of ngram can I use prefixQuery to serve the purpose is there
any advantage of using ngram tokenizer?

Performance. The prefix query is easy to use, but nowhere near as
efficient. first it needs to find all terms which might match, then run
queries on all of those. And, you may have too many matching terms,
etc.

So prefix query is fine for small numbers of terms, but ngrams will
scale

clint

Clinton_Gormley · July 19, 2011, 4:27pm

My knowledge with edge ngram is very less can you please put somelight
over what actually is edge ngram ?
Even I would like to use edge ngram

An ngram is a moving window, so an ngram of length 2 of the word "help"
would give you "he","el","lp"

An edge-ngram is anchored to either the beginning or the end of the
word, eg "h","he","hel","help" or (from the end) "help","elp","lp","p"

clint

Thanks,
Lalit.

On Tue, Jul 19, 2011 at 8:53 PM, Clinton Gormley
clinton@iannounce.co.uk wrote:
Hi Lalit

    > In place of ngram can I use prefixQuery to serve the purpose
    is there
    > any advantage of using ngram tokenizer?
    
    
    Performance.  The prefix query is easy to use, but nowhere
    near as
    efficient.  first it needs to find all terms which might
    match, then run
    queries on all of those.  And, you may have too many matching
    terms,
    etc.
    
    So prefix query is fine for small numbers of terms, but ngrams
    will
    scale
    
    clint

--
Web Announcements Limited is a company registered in England and Wales,
with company number 05608868, with registered address at 10 Arvon Road,
London, N5 1PR.

lalit_mishra · July 19, 2011, 4:28pm

Cool Thanks

On Tue, Jul 19, 2011 at 9:57 PM, Clinton Gormley clinton@iannounce.co.ukwrote:

My knowledge with edge ngram is very less can you please put somelight
over what actually is edge ngram ?
Even I would like to use edge ngram

An ngram is a moving window, so an ngram of length 2 of the word "help"
would give you "he","el","lp"

An edge-ngram is anchored to either the beginning or the end of the
word, eg "h","he","hel","help" or (from the end) "help","elp","lp","p"

clint
Thanks,
Lalit.

On Tue, Jul 19, 2011 at 8:53 PM, Clinton Gormley
clinton@iannounce.co.uk wrote:
Hi Lalit
    > In place of ngram can I use prefixQuery to serve the purpose
    is there
    > any advantage of using ngram tokenizer?


    Performance.  The prefix query is easy to use, but nowhere
    near as
    efficient.  first it needs to find all terms which might
    match, then run
    queries on all of those.  And, you may have too many matching
    terms,
    etc.

    So prefix query is fine for small numbers of terms, but ngrams
    will
    scale

    clint
--
Web Announcements Limited is a company registered in England and Wales,
with company number 05608868, with registered address at 10 Arvon Road,
London, N5 1PR.

Topic		Replies	Views
Advice on my approach to this search problem Elasticsearch	11	521	July 6, 2017
Autocompletion Elasticsearch	18	944	July 6, 2017
edgeNGram minimum length omits shorter words Elasticsearch	12	2854	July 6, 2017
Which is the best (right) use of NGrams? Elasticsearch	19	5705	July 6, 2017
Edge-Ngram returns irrelevant result Elasticsearch	11	1638	July 6, 2017

Help needed with the query

PREPARING YOUR DATA:

SEARCHING YOUR DATA

PREPARING YOUR DATA:

SEARCHING YOUR DATA

PREPARING YOUR DATA:

SEARCHING YOUR DATA

PREPARING YOUR DATA:

SEARCHING YOUR DATA

Related topics