Help needed with the query


(mfeingold) #1

Can you guys help me understand how should I approach my problem. Here
is what I need to do:
As a user types information about a member I need to provide a list of
suggestions based on the provided information. The information may or
may not include the following: member first/last name, his member id
and date of birth.

It is my understanding that I need to index my membership info (~100M
records) so that I can use edit distance and metaphone against first
name/ last name. I also would like to use a list of synonyms for the
first name ('Bob' vs 'Robert'). It also makes sense to use edit
distance against member id - to account for typing errors. Date of
birth I would like to use as a filter - if it is there I only show
members matching it.
Also to further narrow the list I would only include members who live
within certain distance of the service location.

All of this is just thinking aloud (in writing?). I am not sure how
this should (or can it) be translated into ES configuration /query.
Any help?


(Clinton Gormley) #2

Hi Michael

As a user types information about a member I need to provide a list of
suggestions based on the ...: member first/last name, his member id
and date of birth.

It is my understanding that I need to index my membership info (~100M
records) so that I can use edit distance and metaphone against first
name/ last name. I also would like to use a list of synonyms for the
first name ('Bob' vs 'Robert').

Yes - preparing your data correctly is essential.

It also makes sense to use edit
distance against member id - to account for typing errors.

Do you really think this is the case? If somebody is typing a member ID,
then I think they should see JUST the associated user. Otherwise, if
all I have is the member ID, I type that, and it shows me 20 different
users, how do I know which is the one I want? I'll ignore this
requirement.

Date of birth I would like to use as a filter - if it is there I only
show members matching it.

Also to further narrow the list I would only include members who live
within certain distance of the service location.

I apologise in advance - this email is long, but is well worth reading
(and I should probably turn it into a tutorial, as this question is
asked often):

OK, so there are two phases here:

  1. preparing your data, and
  2. searching

PREPARING YOUR DATA:

First, let's decide how each field needs to be indexed, then we can look
at what analyzers we need to provide.

So we have a list of custom analyzers we need to define:

  • full_name:

    • standard token filter
    • lowercase
    • ascii folding
  • partial_name:

    • standard token filter
    • lowercase
    • ascii folding
    • synonyms
    • edge ngrams
  • name_metaphone:

    • standard token filter
    • phonetic/metaphone filter

Here is the command to create the index with the above analyzers and
mapping: https://gist.github.com/1088986

It is quite long, but just goes through the process listed above. If you
look at each block, it's actually quite simple.

NOTES:

  1. For first_name/last_name, I am using multi-fields. The "main"
    sub-field has the same name as the top level, so that if I refer
    to "first_name" it automatically references "first_name.first_name"

    So in effect, I have "first_name" and "first_name.partial" and
    "first_name.metaphone"

  2. In the partial name fields, I am using index_analyzer and
    search_analyzer.

    Normally, you want your data and search terms to use the same
    analyzer - this ensures that you are searching for the same
    terms that are actually stored in ES. For example, in the
    first_name.metaphone field, I just specify an 'analyzer'
    (which sets both the search_analyzer and index_analyzer to
    the same value)

    However, for the partial field, we want them to be different. If we
    store the name "Clinton", we want to be able to use auto-complete
    for search terms like 'clin' (ie partial matches). So at index time,
    we tokenize clinton as c,cl,cli,clin,clint,clinto,clinton

    However, when we search, we don't want 'clin' to match 'cat','cliff'
    etc. So we DON'T want to use the ngram tokenizer on search terms.

So, run the commands in the gist above, and then you can experiment with
searches.

You can see what tokens each analyzer produces with the analyze API.
Try these queries:

curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=rob&analyzer=full_name'
curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=rob&analyzer=partial_name'
curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=rob&analyzer=name_metaphone'

and to check that the ascii folding is working, try 'sánchez' (but URL encoded):

curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=sánchez&analyzer=partial_name'

SEARCHING YOUR DATA

Let's get rid of the easy stuff first:

birthday:

    If your user enters a birthday, then you want to filter the
    results to only include members with a matching birthday:
    
       { term: { birthday: '1970-10-24' }}

location:

    Use a geo_distance filter to find results within 100km of
    London:
    
       { geo_distance: { 
               distance: "100km",
               location: [51.50853, -0.12574]
       }}

member_id:

 This can use a simple text query:

 { text: { member_id: "abc-1234" }}

OK - now the more interesting stuff: first name and last name.

The logic we want to use here is:

Show me any name whose first or last name field matches
completely or partially, but consider full word matches
to be more relevant than partial matches or metaphone
matches

We're going to combine these queries using the 'bool' query. The
difference between the 'bool' query and the 'dismax' query is that the
'bool' query combines the _score/relevance of each matching clause,
while the 'dismax' query chooses the highest _score from the matching
clauses.

{ bool:
{ should: [
{ text: { "first_name": "rob" }}, # full name
{ text: { "first_name.partial": "rob" }} # partial match
{ text: { "first_name.metaphone": "rob"}} # metaphone
]
}}

This will find all docs that match any of the above clauses.

The _score of each matching clause is combined, so a doc which matches
all 3 clauses will rank higher than a doc that matches just one clause,
so we already have some ranking here.

But lets say that we wanted a full word match to be significantly more
relevant than the other two. We can change that clause to:

{ text: { "first_name": {
query: "rob",
boost: 1
}}}

Of course, we need to include the same 3 clauses for "last_name" as
well.

Now, to turn these into a query that we can pass to the Search API:

All search queries must be wrapped in a top-level { query: {....}}
element, which will contain one of 3 possibilities:

  1. just the bool query
    { query: { bool: {...}}

    https://gist.github.com/1089180

  2. just one or more filters
    { query: { constant_score: { filter: {....} }}

    https://gist.github.com/1089206

  3. the bool query combined with one or more filters
    { query: { filtered: { query: {bool: ...}, filter: {.....} }}}

    https://gist.github.com/1089201

This was long, but I hope it was worth it.

If anything isn't clear, please ask, and I can improve this and turn it
into a tutorial.

clint


(mfeingold) #3

Hi Clinton:

Thanks for a quick and detailed response.

To clarify a few points:

  1. The synonyms - I only need the synonyms on the first name, I
    think. I do not imagine it to be of much use for last names. I am not
    sure if dropping synonyms for the last name would have any impact in
    terms of performance or necessary disk space. Based on your templates
    I hope I understand how to do this.
  2. Names - one of the problems I foresee is stemming from the fact
    that I want a single input string with all the search parameters
    (except geoloc). The problem is to tell apart the first name from the
    last name. Additional complication here is that a name (both first and
    last) can consist of several words. My hope was that I can build
    indexes/queries in such a way so that I can throw both names as a
    single string at the query leaving it to the ES to figure out which
    one is which
  3. Edge ngrams - I would like to limit wild card search to first let
    us say 6 chars, assuming that more than that implies exact match. my
    clumsy experiments with the ES made me think that if ngrams are in the
    play, they have to be all the way through the max length of the field,
    otherwise the search misses exact matches. I was doing something
    wrong, I hope
  4. Member ID - it is alphanumeric. I still think that some degree of
    fuzziness can help here. The idea is to let the user type anything he
    knows and provide autosuggest as he types. So if he knows the ID -
    great it should be an immediate hit, but if he mistyped it and also
    provided last name - it still can let me to make a pretty well
    educated guess. So why do not do it? I think though that the edit
    distance allowed here should be minimal.
  5. Geo - I am curious about the performance of this. Does it really
    does sqrt from sum of squares? Does it mean that during this process
    it actually loops through all documents to find the matching ones? The
    reason I am asking is that I do not insist on the circle - I can get
    away with a square, I mean instead of real geo distance here - I can
    use a range on both longitude and latitude. Would it be faster?

On Jul 18, 6:28 am, Clinton Gormley clin...@iannounce.co.uk wrote:

Hi Michael

As a user types information about a member I need to provide a list of
suggestions based on the ...: member first/last name, his member id
and date of birth.

It is my understanding that I need to index my membership info (~100M
records) so that I can use edit distance and metaphone against first
name/ last name. I also would like to use a list of synonyms for the
first name ('Bob' vs 'Robert').

Yes - preparing your data correctly is essential.

It also makes sense to use edit
distance against member id - to account for typing errors.

Do you really think this is the case? If somebody is typing a member ID,
then I think they should see JUST the associated user. Otherwise, if
all I have is the member ID, I type that, and it shows me 20 different
users, how do I know which is the one I want? I'll ignore this
requirement.

Date of birth I would like to use as a filter - if it is there I only
show members matching it.
Also to further narrow the list I would only include members who live
within certain distance of the service location.

I apologise in advance - this email is long, but is well worth reading
(and I should probably turn it into a tutorial, as this question is
asked often):

OK, so there are two phases here:

  1. preparing your data, and
  2. searching

PREPARING YOUR DATA:

First, let's decide how each field needs to be indexed, then we can look
at what analyzers we need to provide.

So we have a list of custom analyzers we need to define:

  • full_name:

    • standard token filter
    • lowercase
    • ascii folding
  • partial_name:

    • standard token filter
    • lowercase
    • ascii folding
    • synonyms
    • edge ngrams
  • name_metaphone:

    • standard token filter
    • phonetic/metaphone filter

Here is the command to create the index with the above analyzers and
mapping:https://gist.github.com/1088986

It is quite long, but just goes through the process listed above. If you
look at each block, it's actually quite simple.

NOTES:

  1. For first_name/last_name, I am using multi-fields. The "main"
    sub-field has the same name as the top level, so that if I refer
    to "first_name" it automatically references "first_name.first_name"

    So in effect, I have "first_name" and "first_name.partial" and
    "first_name.metaphone"

  2. In the partial name fields, I am using index_analyzer and
    search_analyzer.

    Normally, you want your data and search terms to use the same
    analyzer - this ensures that you are searching for the same
    terms that are actually stored in ES. For example, in the
    first_name.metaphone field, I just specify an 'analyzer'
    (which sets both the search_analyzer and index_analyzer to
    the same value)

    However, for the partial field, we want them to be different. If we
    store the name "Clinton", we want to be able to use auto-complete
    for search terms like 'clin' (ie partial matches). So at index time,
    we tokenize clinton as c,cl,cli,clin,clint,clinto,clinton

    However, when we search, we don't want 'clin' to match 'cat','cliff'
    etc. So we DON'T want to use the ngram tokenizer on search terms.

So, run the commands in the gist above, and then you can experiment with
searches.

You can see what tokens each analyzer produces with the analyze API.
Try these queries:

curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=rob&analyzer=full_n...
curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=rob&analyzer=partia...
curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=rob&analyzer=name_m...

and to check that the ascii folding is working, try 'sánchez' (but URL encoded):

curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=sánchez&analyz...

SEARCHING YOUR DATA

Let's get rid of the easy stuff first:

birthday:

    If your user enters a birthday, then you want to filter the
    results to only include members with a matching birthday:

       { term: { birthday: '1970-10-24' }}

location:

    Use a geo_distance filter to find results within 100km of
    London:

       { geo_distance: {
               distance: "100km",
               location: [51.50853, -0.12574]
       }}

member_id:

 This can use a simple text query:

 { text: { member_id: "abc-1234" }}

OK - now the more interesting stuff: first name and last name.

The logic we want to use here is:

Show me any name whose first or last name field matches
completely or partially, but consider full word matches
to be more relevant than partial matches or metaphone
matches

We're going to combine these queries using the 'bool' query. The
difference between the 'bool' query and the 'dismax' query is that the
'bool' query combines the _score/relevance of each matching clause,
while the 'dismax' query chooses the highest _score from the matching
clauses.

{ bool:
{ should: [
{ text: { "first_name": "rob" }}, # full name
{ text: { "first_name.partial": "rob" }} # partial match
{ text: { "first_name.metaphone": "rob"}} # metaphone
]

}}

This will find all docs that match any of the above clauses.

The _score of each matching clause is combined, so a doc which matches
all 3 clauses will rank higher than a doc that matches just one clause,
so we already have some ranking here.

But lets say that we wanted a full word match to be significantly more
relevant than the other two. We can change that clause to:

{ text: { "first_name": {
query: "rob",
boost: 1
}}}

Of course, we need to include the same 3 clauses for "last_name" as
well.

Now, to turn these into a query that we can pass to the Search API:

All search queries must be wrapped in a top-level { query: {....}}
element, which will contain one of 3 possibilities:

  1. just the bool query
    { query: { bool: {...}}

    https://gist.github.com/1089180

  2. just one or more filters
    { query: { constant_score: { filter: {....} }}

    https://gist.github.com/1089206

  3. the bool query combined with one or more filters
    { query: { filtered: { query: {bool: ...}, filter: {.....} }}}

    https://gist.github.com/1089201

This was long, but I hope it was worth it.

If anything isn't clear, please ask, and I can improve this and turn it
into a tutorial.

clint


(Clinton Gormley) #4

Hi Michael

  1. The synonyms - I only need the synonyms on the first name, I
    think. I do not imagine it to be of much use for last names. I am not
    sure if dropping synonyms for the last name would have any impact in
    terms of performance or necessary disk space. Based on your templates
    I hope I understand how to do this.

That's fine, just define one custom analyzer with synonyms, and use that
for first names, and another without synonyms, for the last name.

  1. Names - one of the problems I foresee is stemming from the fact
    that I want a single input string with all the search parameters
    (except geoloc). The problem is to tell apart the first name from the
    last name. Additional complication here is that a name (both first and
    last) can consist of several words. My hope was that I can build
    indexes/queries in such a way so that I can throw both names as a
    single string at the query leaving it to the ES to figure out which
    one is which

In my example, "rob smith" looks for "rob OR smith", and you're running
that against first name and last name, so you can use the same search
string. It will find 'rob' in first name and 'smith' in last name.

  1. Edge ngrams - I would like to limit wild card search to first let
    us say 6 chars, assuming that more than that implies exact match. my
    clumsy experiments with the ES made me think that if ngrams are in the
    play, they have to be all the way through the max length of the field,
    otherwise the search misses exact matches. I was doing something
    wrong, I hope

You can limit the ngrams, but there is no real reason to do so. Also,
you have the full word version of the field which it will match against.
In the bool query I'm using 'should' (which is like 'or') so not all
fields need to match.

  1. Member ID - it is alphanumeric. I still think that some degree of
    fuzziness can help here. The idea is to let the user type anything he
    knows and provide autosuggest as he types. So if he knows the ID -
    great it should be an immediate hit, but if he mistyped it and also
    provided last name - it still can let me to make a pretty well
    educated guess. So why do not do it? I think though that the edit
    distance allowed here should be minimal.

That's probably fine - with the text query you can use the 'fuzziness'
parameter,

  1. Geo - I am curious about the performance of this. Does it really
    does sqrt from sum of squares? Does it mean that during this process
    it actually loops through all documents to find the matching ones? The
    reason I am asking is that I do not insist on the circle - I can get
    away with a square, I mean instead of real geo distance here - I can
    use a range on both longitude and latitude. Would it be faster?

geo_distance is fast. no idea how it works internally, but no worries
there

clint

On Jul 18, 6:28 am, Clinton Gormley clin...@iannounce.co.uk wrote:

Hi Michael

As a user types information about a member I need to provide a list of
suggestions based on the ...: member first/last name, his member id
and date of birth.

It is my understanding that I need to index my membership info (~100M
records) so that I can use edit distance and metaphone against first
name/ last name. I also would like to use a list of synonyms for the
first name ('Bob' vs 'Robert').

Yes - preparing your data correctly is essential.

It also makes sense to use edit
distance against member id - to account for typing errors.

Do you really think this is the case? If somebody is typing a member ID,
then I think they should see JUST the associated user. Otherwise, if
all I have is the member ID, I type that, and it shows me 20 different
users, how do I know which is the one I want? I'll ignore this
requirement.

Date of birth I would like to use as a filter - if it is there I only
show members matching it.
Also to further narrow the list I would only include members who live
within certain distance of the service location.

I apologise in advance - this email is long, but is well worth reading
(and I should probably turn it into a tutorial, as this question is
asked often):

OK, so there are two phases here:

  1. preparing your data, and
  2. searching

PREPARING YOUR DATA:

First, let's decide how each field needs to be indexed, then we can look
at what analyzers we need to provide.

So we have a list of custom analyzers we need to define:

  • full_name:

    • standard token filter
    • lowercase
    • ascii folding
  • partial_name:

    • standard token filter
    • lowercase
    • ascii folding
    • synonyms
    • edge ngrams
  • name_metaphone:

    • standard token filter
    • phonetic/metaphone filter

Here is the command to create the index with the above analyzers and
mapping:https://gist.github.com/1088986

It is quite long, but just goes through the process listed above. If you
look at each block, it's actually quite simple.

NOTES:

  1. For first_name/last_name, I am using multi-fields. The "main"
    sub-field has the same name as the top level, so that if I refer
    to "first_name" it automatically references "first_name.first_name"

    So in effect, I have "first_name" and "first_name.partial" and
    "first_name.metaphone"

  2. In the partial name fields, I am using index_analyzer and
    search_analyzer.

    Normally, you want your data and search terms to use the same
    analyzer - this ensures that you are searching for the same
    terms that are actually stored in ES. For example, in the
    first_name.metaphone field, I just specify an 'analyzer'
    (which sets both the search_analyzer and index_analyzer to
    the same value)

    However, for the partial field, we want them to be different. If we
    store the name "Clinton", we want to be able to use auto-complete
    for search terms like 'clin' (ie partial matches). So at index time,
    we tokenize clinton as c,cl,cli,clin,clint,clinto,clinton

    However, when we search, we don't want 'clin' to match 'cat','cliff'
    etc. So we DON'T want to use the ngram tokenizer on search terms.

So, run the commands in the gist above, and then you can experiment with
searches.

You can see what tokens each analyzer produces with the analyze API.
Try these queries:

curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=rob&analyzer=full_n...
curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=rob&analyzer=partia...
curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=rob&analyzer=name_m...

and to check that the ascii folding is working, try 'sánchez' (but URL encoded):

curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=sánchez&analyz...

SEARCHING YOUR DATA

Let's get rid of the easy stuff first:

birthday:

    If your user enters a birthday, then you want to filter the
    results to only include members with a matching birthday:

       { term: { birthday: '1970-10-24' }}

location:

    Use a geo_distance filter to find results within 100km of
    London:

       { geo_distance: {
               distance: "100km",
               location: [51.50853, -0.12574]
       }}

member_id:

 This can use a simple text query:

 { text: { member_id: "abc-1234" }}

OK - now the more interesting stuff: first name and last name.

The logic we want to use here is:

Show me any name whose first or last name field matches
completely or partially, but consider full word matches
to be more relevant than partial matches or metaphone
matches

We're going to combine these queries using the 'bool' query. The
difference between the 'bool' query and the 'dismax' query is that the
'bool' query combines the _score/relevance of each matching clause,
while the 'dismax' query chooses the highest _score from the matching
clauses.

{ bool:
{ should: [
{ text: { "first_name": "rob" }}, # full name
{ text: { "first_name.partial": "rob" }} # partial match
{ text: { "first_name.metaphone": "rob"}} # metaphone
]

}}

This will find all docs that match any of the above clauses.

The _score of each matching clause is combined, so a doc which matches
all 3 clauses will rank higher than a doc that matches just one clause,
so we already have some ranking here.

But lets say that we wanted a full word match to be significantly more
relevant than the other two. We can change that clause to:

{ text: { "first_name": {
query: "rob",
boost: 1
}}}

Of course, we need to include the same 3 clauses for "last_name" as
well.

Now, to turn these into a query that we can pass to the Search API:

All search queries must be wrapped in a top-level { query: {....}}
element, which will contain one of 3 possibilities:

  1. just the bool query
    { query: { bool: {...}}

    https://gist.github.com/1089180

  2. just one or more filters
    { query: { constant_score: { filter: {....} }}

    https://gist.github.com/1089206

  3. the bool query combined with one or more filters
    { query: { filtered: { query: {bool: ...}, filter: {.....} }}}

    https://gist.github.com/1089201

This was long, but I hope it was worth it.

If anything isn't clear, please ask, and I can improve this and turn it
into a tutorial.

clint

--
Web Announcements Limited is a company registered in England and Wales,
with company number 05608868, with registered address at 10 Arvon Road,
London, N5 1PR.


(Shay Banon) #5

I suggest you start with mapping. Start with a simple one, where you have
mappings set for different elements (first name, last name), with custom
analyzers that you define that use what you want. You might need to use
multi_field mapping type if you want to have several analyzers applied to
the same field.

On Sun, Jul 17, 2011 at 11:17 PM, Michael Feingold mfeingold@hill30.comwrote:

Can you guys help me understand how should I approach my problem. Here
is what I need to do:
As a user types information about a member I need to provide a list of
suggestions based on the provided information. The information may or
may not include the following: member first/last name, his member id
and date of birth.

It is my understanding that I need to index my membership info (~100M
records) so that I can use edit distance and metaphone against first
name/ last name. I also would like to use a list of synonyms for the
first name ('Bob' vs 'Robert'). It also makes sense to use edit
distance against member id - to account for typing errors. Date of
birth I would like to use as a filter - if it is there I only show
members matching it.
Also to further narrow the list I would only include members who live
within certain distance of the service location.

All of this is just thinking aloud (in writing?). I am not sure how
this should (or can it) be translated into ES configuration /query.
Any help?


(lalit mishra) #6

Hi Clinton,
In place of ngram can I use prefixQuery to serve the purpose is there any
advantage of using ngram tokenizer?

Below configuration is an example
{
"tweet" : {
"properties" : {
"shortName" : {
"type" : "multi_field",
"fields" : {
"name" : {"type" : "string", "index" : "analyzed"},
"untouched" : {"type" : "string", "index" :
"not_analyzed"}
}
}
}
}
}

query name.untouched for exact search using textPhrase
and prefix query for partial search

Please let me know if you thing otherwise

Thanks,
Lalit.

On Mon, Jul 18, 2011 at 4:58 PM, Clinton Gormley clinton@iannounce.co.ukwrote:

Hi Michael

As a user types information about a member I need to provide a list of
suggestions based on the ...: member first/last name, his member id
and date of birth.

It is my understanding that I need to index my membership info (~100M
records) so that I can use edit distance and metaphone against first
name/ last name. I also would like to use a list of synonyms for the
first name ('Bob' vs 'Robert').

Yes - preparing your data correctly is essential.

It also makes sense to use edit
distance against member id - to account for typing errors.

Do you really think this is the case? If somebody is typing a member ID,
then I think they should see JUST the associated user. Otherwise, if
all I have is the member ID, I type that, and it shows me 20 different
users, how do I know which is the one I want? I'll ignore this
requirement.

Date of birth I would like to use as a filter - if it is there I only
show members matching it.

Also to further narrow the list I would only include members who live
within certain distance of the service location.

I apologise in advance - this email is long, but is well worth reading
(and I should probably turn it into a tutorial, as this question is
asked often):

OK, so there are two phases here:

  1. preparing your data, and
  2. searching

PREPARING YOUR DATA:

First, let's decide how each field needs to be indexed, then we can look
at what analyzers we need to provide.

  • first name / last name:

  • these are string fields

  • we want to use synonyms (eg Robert vs Bob)

http://www.elasticsearch.org/guide/reference/index-modules/analysis/synonym-tokenfilter.html

  • we want to include metaphones

http://www.elasticsearch.org/guide/reference/index-modules/analysis/phonetic-tokenfilter.html

http://www.elasticsearch.org/guide/reference/index-modules/analysis/asciifolding-tokenfilter.html

  • we want to do 3 types of matches:

    • most relevant: full word matches
    • less relevant: partial word matches (eg with ngrams, synonyms)
    • least relevant: metaphone matches

    so we'll index the names with three versions, as a multi-field

http://www.elasticsearch.org/guide/reference/mapping/multi-field-type.html

  • member id
    • you didn't specify if this is an numeric or alphanumeric, so I'll
      just assume alphanumeric, possibly with punctuation, eg
      "ABC-1234"

    • let's say that we want to tokenize this as "abc","1234", so we'll
      use the "simple" analyzer

http://www.elasticsearch.org/guide/reference/index-modules/analysis/simple-analyzer.html

  • birthday

    • this is just a date field, no analysis needed
  • location

    • this is a geo_point field, no analysis needed

So we have a list of custom analyzers we need to define:

  • full_name:

    • standard token filter
    • lowercase
    • ascii folding
  • partial_name:

    • standard token filter
    • lowercase
    • ascii folding
    • synonyms
    • edge ngrams
  • name_metaphone:

    • standard token filter
    • phonetic/metaphone filter

Here is the command to create the index with the above analyzers and
mapping: https://gist.github.com/1088986

It is quite long, but just goes through the process listed above. If you
look at each block, it's actually quite simple.

NOTES:

  1. For first_name/last_name, I am using multi-fields. The "main"
    sub-field has the same name as the top level, so that if I refer
    to "first_name" it automatically references "first_name.first_name"

So in effect, I have "first_name" and "first_name.partial" and
"first_name.metaphone"

  1. In the partial name fields, I am using index_analyzer and
    search_analyzer.

Normally, you want your data and search terms to use the same
analyzer - this ensures that you are searching for the same
terms that are actually stored in ES. For example, in the
first_name.metaphone field, I just specify an 'analyzer'
(which sets both the search_analyzer and index_analyzer to
the same value)

However, for the partial field, we want them to be different. If we
store the name "Clinton", we want to be able to use auto-complete
for search terms like 'clin' (ie partial matches). So at index time,
we tokenize clinton as c,cl,cli,clin,clint,clinto,clinton

However, when we search, we don't want 'clin' to match 'cat','cliff'
etc. So we DON'T want to use the ngram tokenizer on search terms.

So, run the commands in the gist above, and then you can experiment with
searches.

You can see what tokens each analyzer produces with the analyze API.
Try these queries:

curl -XGET '
http://127.0.0.1:9200/test/_analyze?pretty=1&text=rob&analyzer=full_name'
curl -XGET '
http://127.0.0.1:9200/test/_analyze?pretty=1&text=rob&analyzer=partial_name
'
curl -XGET '
http://127.0.0.1:9200/test/_analyze?pretty=1&text=rob&analyzer=name_metaphone
'

and to check that the ascii folding is working, try 'sánchez' (but URL
encoded):

curl -XGET '
http://127.0.0.1:9200/test/_analyze?pretty=1&text=sánchez&analyzer=partial_name
'

SEARCHING YOUR DATA

Let's get rid of the easy stuff first:

birthday:

   If your user enters a birthday, then you want to filter the
   results to only include members with a matching birthday:

      { term: { birthday: '1970-10-24' }}

location:

   Use a geo_distance filter to find results within 100km of
   London:

      { geo_distance: {
              distance: "100km",
              location: [51.50853, -0.12574]
      }}

member_id:

This can use a simple text query:

{ text: { member_id: "abc-1234" }}

OK - now the more interesting stuff: first name and last name.

The logic we want to use here is:

Show me any name whose first or last name field matches
completely or partially, but consider full word matches
to be more relevant than partial matches or metaphone
matches

We're going to combine these queries using the 'bool' query. The
difference between the 'bool' query and the 'dismax' query is that the
'bool' query combines the _score/relevance of each matching clause,
while the 'dismax' query chooses the highest _score from the matching
clauses.

{ bool:
{ should: [
{ text: { "first_name": "rob" }}, # full name
{ text: { "first_name.partial": "rob" }} # partial match
{ text: { "first_name.metaphone": "rob"}} # metaphone
]
}}

This will find all docs that match any of the above clauses.

The _score of each matching clause is combined, so a doc which matches
all 3 clauses will rank higher than a doc that matches just one clause,
so we already have some ranking here.

But lets say that we wanted a full word match to be significantly more
relevant than the other two. We can change that clause to:

{ text: { "first_name": {
query: "rob",
boost: 1
}}}

Of course, we need to include the same 3 clauses for "last_name" as
well.

Now, to turn these into a query that we can pass to the Search API:

All search queries must be wrapped in a top-level { query: {....}}
element, which will contain one of 3 possibilities:

  1. just the bool query
    { query: { bool: {...}}

https://gist.github.com/1089180

  1. just one or more filters
    { query: { constant_score: { filter: {....} }}

https://gist.github.com/1089206

  1. the bool query combined with one or more filters
    { query: { filtered: { query: {bool: ...}, filter: {.....} }}}

https://gist.github.com/1089201

This was long, but I hope it was worth it.

If anything isn't clear, please ask, and I can improve this and turn it
into a tutorial.

clint


(Clinton Gormley) #7

Hi Lalit

In place of ngram can I use prefixQuery to serve the purpose is there
any advantage of using ngram tokenizer?

Performance. The prefix query is easy to use, but nowhere near as
efficient. first it needs to find all terms which might match, then run
queries on all of those. And, you may have too many matching terms,
etc.

So prefix query is fine for small numbers of terms, but ngrams will
scale

clint


(lalit mishra) #8

Thanks Clinton for quick response.

My knowledge with edge ngram is very less can you please put somelight over
what actually is edge ngram ?
Even I would like to use edge ngram

Thanks,
Lalit.

On Tue, Jul 19, 2011 at 8:53 PM, Clinton Gormley clinton@iannounce.co.ukwrote:

Hi Lalit

In place of ngram can I use prefixQuery to serve the purpose is there
any advantage of using ngram tokenizer?

Performance. The prefix query is easy to use, but nowhere near as
efficient. first it needs to find all terms which might match, then run
queries on all of those. And, you may have too many matching terms,
etc.

So prefix query is fine for small numbers of terms, but ngrams will
scale

clint


(Clinton Gormley) #9

My knowledge with edge ngram is very less can you please put somelight
over what actually is edge ngram ?
Even I would like to use edge ngram

An ngram is a moving window, so an ngram of length 2 of the word "help"
would give you "he","el","lp"

An edge-ngram is anchored to either the beginning or the end of the
word, eg "h","he","hel","help" or (from the end) "help","elp","lp","p"

clint

Thanks,
Lalit.

On Tue, Jul 19, 2011 at 8:53 PM, Clinton Gormley
clinton@iannounce.co.uk wrote:
Hi Lalit

    > In place of ngram can I use prefixQuery to serve the purpose
    is there
    > any advantage of using ngram tokenizer?
    
    
    Performance.  The prefix query is easy to use, but nowhere
    near as
    efficient.  first it needs to find all terms which might
    match, then run
    queries on all of those.  And, you may have too many matching
    terms,
    etc.
    
    So prefix query is fine for small numbers of terms, but ngrams
    will
    scale
    
    clint

--
Web Announcements Limited is a company registered in England and Wales,
with company number 05608868, with registered address at 10 Arvon Road,
London, N5 1PR.


(lalit mishra) #10

Cool Thanks :slight_smile:

On Tue, Jul 19, 2011 at 9:57 PM, Clinton Gormley clinton@iannounce.co.ukwrote:

My knowledge with edge ngram is very less can you please put somelight
over what actually is edge ngram ?
Even I would like to use edge ngram

An ngram is a moving window, so an ngram of length 2 of the word "help"
would give you "he","el","lp"

An edge-ngram is anchored to either the beginning or the end of the
word, eg "h","he","hel","help" or (from the end) "help","elp","lp","p"

clint

Thanks,
Lalit.

On Tue, Jul 19, 2011 at 8:53 PM, Clinton Gormley
clinton@iannounce.co.uk wrote:
Hi Lalit

    > In place of ngram can I use prefixQuery to serve the purpose
    is there
    > any advantage of using ngram tokenizer?


    Performance.  The prefix query is easy to use, but nowhere
    near as
    efficient.  first it needs to find all terms which might
    match, then run
    queries on all of those.  And, you may have too many matching
    terms,
    etc.

    So prefix query is fine for small numbers of terms, but ngrams
    will
    scale

    clint

--
Web Announcements Limited is a company registered in England and Wales,
with company number 05608868, with registered address at 10 Arvon Road,
London, N5 1PR.


(system) #11