Hypenation and superfluous results with ngram analyser for autocomplete

I am trying to configure elasticsearch for autocomplete and have been
quite successful in doing so, however there are a couple of behaviours I
would like to tweak if possible.

  1. When searching for 'Mercedes-Benz' no results are returned with the
    current setup even though one of the indexed items contains the term.
    'mercedes benz' 'merc' and 'benz' all match the right item as expected.

  2. When searching for 'Mercedes-Be' I get a superfluous result: "Being
    Cool With Bond, James Bond". The term is obviously being broken into
    'mercedes' and 'be', the latter matching the start of "Being" however I
    would rather the second word act to further limit the results presented to
    the user (as is probably expected).

The results, settings and mapping are listed in the following gist:

Could anyone offer any guidance on how to fix these issues?

Cheers,

Jon

--

During indexing you are using standard tokenizer that splits words on "-".
So, 'Mercedes-Benz' is indexed like this:

$ curl -s
"localhost:9200/courses/_analyze?analyzer=autocomplete_analyzer&pretty=true"
-d "Mercedes-Benz" | grep "token"
"token" : "me",
"token" : "mer",
"token" : "merc",
"token" : "merce",
"token" : "merced",
"token" : "mercede",
"token" : "mercedes",
"token" : "be",
"token" : "ben",
"token" : "benz",

For the search part you are using keyword tokenizer, which doesn't tokenize
at all, so as a result the query for "mercedes-benz" is getting translated
into the query for the term "mercedes-benz":

$ curl -s -X GET
'http://localhost:9200/courses/course/_validate/query?pretty=true&explain=true'
-d '{
"query_string":{
"query":"mercedes-benz",
"fields":[
"name"
]
}
}' | grep "explanation"
"explanation" : "name:mercedes-benz"

There is no token "mercedes-benz" in the index, so you get no results. When
you replace "-" with space, the query_string parser splits the query into
to parts:

$ curl -s -X GET
'http://localhost:9200/courses/course/_validate/query?pretty=true&explain=true'
-d '{
"query_string":{
"query":"mercedes benz",
"fields":[
"name"
]
}
}' | grep "explanation"
"explanation" : "name:mercedes name:benz"

It searches for the term mercedes OR the term benz and you get expected
result. But because of this "OR" you are also finding "Being Cool..." when
you replace your query with "mercedes be".

To fix it, you should first replace query_string query with something that
wouldn't interfere with your tokenization. The Matchhttp://www.elasticsearch.org/guide/reference/query-dsl/match-query.htmlquery might be a good candidate. It still leaves the mismatch between
search tokenizer and index tokenizer to be addressed. There are few options
here. The simplest one is to replace the search analyzer with standard
analyzer with no stop words. That will work work in most case. The only
potential issue here is that it will disregard word order in you search. So
it will also find you "Being Cool" when you search for "Cool Being". Not
sure if this is something that you want to avoid or not.

On Tuesday, January 15, 2013 3:09:58 AM UTC-5, Jonathan Evans wrote:

I am trying to configure elasticsearch for autocomplete and have been
quite successful in doing so, however there are a couple of behaviours I
would like to tweak if possible.

  1. When searching for 'Mercedes-Benz' no results are returned with the
    current setup even though one of the indexed items contains the term.
    'mercedes benz' 'merc' and 'benz' all match the right item as expected.

  2. When searching for 'Mercedes-Be' I get a superfluous result: "Being
    Cool With Bond, James Bond". The term is obviously being broken into
    'mercedes' and 'be', the latter matching the start of "Being" however I
    would rather the second word act to further limit the results presented to
    the user (as is probably expected).

The results, settings and mapping are listed in the following gist:
https://gist.github.com/4537084

Could anyone offer any guidance on how to fix these issues?

Cheers,

Jon

--

Cheers Igor, this works great! I have a much better understanding after
your detailed breakdown as well.

One further undesirable thing I have noticed is that if I type Mercedes
Benz Drama it returns an item "primary drama". I cannot think of anyway
really to fix this as its the same sort of problem as the ordering problem
you mentioned.

Any ideas?

On Tuesday, January 15, 2013 4:40:39 PM UTC, Igor Motov wrote:

During indexing you are using standard tokenizer that splits words on "-".
So, 'Mercedes-Benz' is indexed like this:

$ curl -s
"localhost:9200/courses/_analyze?analyzer=autocomplete_analyzer&pretty=true"
-d "Mercedes-Benz" | grep "token"
"token" : "me",
"token" : "mer",
"token" : "merc",
"token" : "merce",
"token" : "merced",
"token" : "mercede",
"token" : "mercedes",
"token" : "be",
"token" : "ben",
"token" : "benz",

For the search part you are using keyword tokenizer, which doesn't
tokenize at all, so as a result the query for "mercedes-benz" is getting
translated into the query for the term "mercedes-benz":

$ curl -s -X GET '
http://localhost:9200/courses/course/_validate/query?pretty=true&explain=true'
-d '{
"query_string":{
"query":"mercedes-benz",
"fields":[
"name"
]
}
}' | grep "explanation"
"explanation" : "name:mercedes-benz"

There is no token "mercedes-benz" in the index, so you get no results.
When you replace "-" with space, the query_string parser splits the query
into to parts:

$ curl -s -X GET '
http://localhost:9200/courses/course/_validate/query?pretty=true&explain=true'
-d '{
"query_string":{
"query":"mercedes benz",
"fields":[
"name"
]
}
}' | grep "explanation"
"explanation" : "name:mercedes name:benz"

It searches for the term mercedes OR the term benz and you get expected
result. But because of this "OR" you are also finding "Being Cool..." when
you replace your query with "mercedes be".

To fix it, you should first replace query_string query with something that
wouldn't interfere with your tokenization. The Matchhttp://www.elasticsearch.org/guide/reference/query-dsl/match-query.htmlquery might be a good candidate. It still leaves the mismatch between
search tokenizer and index tokenizer to be addressed. There are few options
here. The simplest one is to replace the search analyzer with standard
analyzer with no stop words. That will work work in most case. The only
potential issue here is that it will disregard word order in you search. So
it will also find you "Being Cool" when you search for "Cool Being". Not
sure if this is something that you want to avoid or not.

On Tuesday, January 15, 2013 3:09:58 AM UTC-5, Jonathan Evans wrote:

I am trying to configure elasticsearch for autocomplete and have been
quite successful in doing so, however there are a couple of behaviours I
would like to tweak if possible.

  1. When searching for 'Mercedes-Benz' no results are returned with the
    current setup even though one of the indexed items contains the term.
    'mercedes benz' 'merc' and 'benz' all match the right item as expected.

  2. When searching for 'Mercedes-Be' I get a superfluous result: "Being
    Cool With Bond, James Bond". The term is obviously being broken into
    'mercedes' and 'be', the latter matching the start of "Being" however I
    would rather the second word act to further limit the results presented to
    the user (as is probably expected).

The results, settings and mapping are listed in the following gist:
https://gist.github.com/4537084

Could anyone offer any guidance on how to fix these issues?

Cheers,

Jon

--

You could add an additional query like text_phrase that boosts phrase
matches. Alternatively, you could build phrase matches into your index
with something like shingles.

-Zach

On Wednesday, January 16, 2013 5:48:30 AM UTC-5, Jonathan Evans wrote:

Cheers Igor, this works great! I have a much better understanding after
your detailed breakdown as well.

One further undesirable thing I have noticed is that if I type Mercedes
Benz Drama it returns an item "primary drama". I cannot think of anyway
really to fix this as its the same sort of problem as the ordering problem
you mentioned.

Any ideas?

On Tuesday, January 15, 2013 4:40:39 PM UTC, Igor Motov wrote:

During indexing you are using standard tokenizer that splits words on
"-". So, 'Mercedes-Benz' is indexed like this:

$ curl -s
"localhost:9200/courses/_analyze?analyzer=autocomplete_analyzer&pretty=true"
-d "Mercedes-Benz" | grep "token"
"token" : "me",
"token" : "mer",
"token" : "merc",
"token" : "merce",
"token" : "merced",
"token" : "mercede",
"token" : "mercedes",
"token" : "be",
"token" : "ben",
"token" : "benz",

For the search part you are using keyword tokenizer, which doesn't
tokenize at all, so as a result the query for "mercedes-benz" is getting
translated into the query for the term "mercedes-benz":

$ curl -s -X GET '
http://localhost:9200/courses/course/_validate/query?pretty=true&explain=true'
-d '{
"query_string":{
"query":"mercedes-benz",
"fields":[
"name"
]
}
}' | grep "explanation"
"explanation" : "name:mercedes-benz"

There is no token "mercedes-benz" in the index, so you get no results.
When you replace "-" with space, the query_string parser splits the query
into to parts:

$ curl -s -X GET '
http://localhost:9200/courses/course/_validate/query?pretty=true&explain=true'
-d '{
"query_string":{
"query":"mercedes benz",
"fields":[
"name"
]
}
}' | grep "explanation"
"explanation" : "name:mercedes name:benz"

It searches for the term mercedes OR the term benz and you get expected
result. But because of this "OR" you are also finding "Being Cool..." when
you replace your query with "mercedes be".

To fix it, you should first replace query_string query with something
that wouldn't interfere with your tokenization. The Matchhttp://www.elasticsearch.org/guide/reference/query-dsl/match-query.htmlquery might be a good candidate. It still leaves the mismatch between
search tokenizer and index tokenizer to be addressed. There are few options
here. The simplest one is to replace the search analyzer with standard
analyzer with no stop words. That will work work in most case. The only
potential issue here is that it will disregard word order in you search. So
it will also find you "Being Cool" when you search for "Cool Being". Not
sure if this is something that you want to avoid or not.

On Tuesday, January 15, 2013 3:09:58 AM UTC-5, Jonathan Evans wrote:

I am trying to configure elasticsearch for autocomplete and have been
quite successful in doing so, however there are a couple of behaviours I
would like to tweak if possible.

  1. When searching for 'Mercedes-Benz' no results are returned with the
    current setup even though one of the indexed items contains the term.
    'mercedes benz' 'merc' and 'benz' all match the right item as expected.

  2. When searching for 'Mercedes-Be' I get a superfluous result: "Being
    Cool With Bond, James Bond". The term is obviously being broken into
    'mercedes' and 'be', the latter matching the start of "Being" however I
    would rather the second word act to further limit the results presented to
    the user (as is probably expected).

The results, settings and mapping are listed in the following gist:
https://gist.github.com/4537084

Could anyone offer any guidance on how to fix these issues?

Cheers,

Jon

--

It happens because match is using OR operator by default as well. But
that's easy to fix. Just change operator to "and" in your match query like
this:

"match":{
  "name":{
    "query": "Mercedes-Benz",
  •    "operator": "and"*
    }
    
    }

On Wednesday, January 16, 2013 5:48:30 AM UTC-5, Jonathan Evans wrote:

Cheers Igor, this works great! I have a much better understanding after
your detailed breakdown as well.

One further undesirable thing I have noticed is that if I type Mercedes
Benz Drama it returns an item "primary drama". I cannot think of anyway
really to fix this as its the same sort of problem as the ordering problem
you mentioned.

Any ideas?

On Tuesday, January 15, 2013 4:40:39 PM UTC, Igor Motov wrote:

During indexing you are using standard tokenizer that splits words on
"-". So, 'Mercedes-Benz' is indexed like this:

$ curl -s
"localhost:9200/courses/_analyze?analyzer=autocomplete_analyzer&pretty=true"
-d "Mercedes-Benz" | grep "token"
"token" : "me",
"token" : "mer",
"token" : "merc",
"token" : "merce",
"token" : "merced",
"token" : "mercede",
"token" : "mercedes",
"token" : "be",
"token" : "ben",
"token" : "benz",

For the search part you are using keyword tokenizer, which doesn't
tokenize at all, so as a result the query for "mercedes-benz" is getting
translated into the query for the term "mercedes-benz":

$ curl -s -X GET '
http://localhost:9200/courses/course/_validate/query?pretty=true&explain=true'
-d '{
"query_string":{
"query":"mercedes-benz",
"fields":[
"name"
]
}
}' | grep "explanation"
"explanation" : "name:mercedes-benz"

There is no token "mercedes-benz" in the index, so you get no results.
When you replace "-" with space, the query_string parser splits the query
into to parts:

$ curl -s -X GET '
http://localhost:9200/courses/course/_validate/query?pretty=true&explain=true'
-d '{
"query_string":{
"query":"mercedes benz",
"fields":[
"name"
]
}
}' | grep "explanation"
"explanation" : "name:mercedes name:benz"

It searches for the term mercedes OR the term benz and you get expected
result. But because of this "OR" you are also finding "Being Cool..." when
you replace your query with "mercedes be".

To fix it, you should first replace query_string query with something
that wouldn't interfere with your tokenization. The Matchhttp://www.elasticsearch.org/guide/reference/query-dsl/match-query.htmlquery might be a good candidate. It still leaves the mismatch between
search tokenizer and index tokenizer to be addressed. There are few options
here. The simplest one is to replace the search analyzer with standard
analyzer with no stop words. That will work work in most case. The only
potential issue here is that it will disregard word order in you search. So
it will also find you "Being Cool" when you search for "Cool Being". Not
sure if this is something that you want to avoid or not.

On Tuesday, January 15, 2013 3:09:58 AM UTC-5, Jonathan Evans wrote:

I am trying to configure elasticsearch for autocomplete and have been
quite successful in doing so, however there are a couple of behaviours I
would like to tweak if possible.

  1. When searching for 'Mercedes-Benz' no results are returned with the
    current setup even though one of the indexed items contains the term.
    'mercedes benz' 'merc' and 'benz' all match the right item as expected.

  2. When searching for 'Mercedes-Be' I get a superfluous result: "Being
    Cool With Bond, James Bond". The term is obviously being broken into
    'mercedes' and 'be', the latter matching the start of "Being" however I
    would rather the second word act to further limit the results presented to
    the user (as is probably expected).

The results, settings and mapping are listed in the following gist:
https://gist.github.com/4537084

Could anyone offer any guidance on how to fix these issues?

Cheers,

Jon

--

Awesome response!

--