Hypenation and superfluous results with ngram analyser for autocomplete

Jonathan_Evans · January 15, 2013, 8:09am

I am trying to configure elasticsearch for autocomplete and have been
quite successful in doing so, however there are a couple of behaviours I
would like to tweak if possible.

When searching for 'Mercedes-Benz' no results are returned with the
current setup even though one of the indexed items contains the term.
'mercedes benz' 'merc' and 'benz' all match the right item as expected.
When searching for 'Mercedes-Be' I get a superfluous result: "Being
Cool With Bond, James Bond". The term is obviously being broken into
'mercedes' and 'be', the latter matching the start of "Being" however I
would rather the second word act to further limit the results presented to
the user (as is probably expected).

The results, settings and mapping are listed in the following gist:

gist.github.com

https://gist.github.com/purem/4537084

ngram_test.sh

# ========================================
# Testing n-gram analysis in ElasticSearch
# ========================================

curl -X DELETE localhost:9200/courses
curl -X POST http://localhost:9200/courses -d '{
  "mappings":{
    "course":{
      "properties":{
        "name":{

This file has been truncated. show original

Could anyone offer any guidance on how to fix these issues?

Cheers,

Jon

--

Igor_Motov · January 15, 2013, 4:40pm

During indexing you are using standard tokenizer that splits words on "-".
So, 'Mercedes-Benz' is indexed like this:

$ curl -s
"localhost:9200/courses/_analyze?analyzer=autocomplete_analyzer&pretty=true"
-d "Mercedes-Benz" | grep "token"
"token" : "me",
"token" : "mer",
"token" : "merc",
"token" : "merce",
"token" : "merced",
"token" : "mercede",
"token" : "mercedes",
"token" : "be",
"token" : "ben",
"token" : "benz",

For the search part you are using keyword tokenizer, which doesn't tokenize
at all, so as a result the query for "mercedes-benz" is getting translated
into the query for the term "mercedes-benz":

$ curl -s -X GET
'http://localhost:9200/courses/course/_validate/query?pretty=true&explain=true'
-d '{
"query_string":{
"query":"mercedes-benz",
"fields":[
"name"
]
}
}' | grep "explanation"
"explanation" : "name:mercedes-benz"

There is no token "mercedes-benz" in the index, so you get no results. When
you replace "-" with space, the query_string parser splits the query into
to parts:

$ curl -s -X GET
'http://localhost:9200/courses/course/_validate/query?pretty=true&explain=true'
-d '{
"query_string":{
"query":"mercedes benz",
"fields":[
"name"
]
}
}' | grep "explanation"
"explanation" : "name:mercedes name:benz"

It searches for the term mercedes OR the term benz and you get expected
result. But because of this "OR" you are also finding "Being Cool..." when
you replace your query with "mercedes be".

To fix it, you should first replace query_string query with something that
wouldn't interfere with your tokenization. The Matchhttp://www.elasticsearch.org/guide/reference/query-dsl/match-query.htmlquery might be a good candidate. It still leaves the mismatch between
search tokenizer and index tokenizer to be addressed. There are few options
here. The simplest one is to replace the search analyzer with standard
analyzer with no stop words. That will work work in most case. The only
potential issue here is that it will disregard word order in you search. So
it will also find you "Being Cool" when you search for "Cool Being". Not
sure if this is something that you want to avoid or not.

On Tuesday, January 15, 2013 3:09:58 AM UTC-5, Jonathan Evans wrote:

I am trying to configure elasticsearch for autocomplete and have been
quite successful in doing so, however there are a couple of behaviours I
would like to tweak if possible.

When searching for 'Mercedes-Benz' no results are returned with the
current setup even though one of the indexed items contains the term.
'mercedes benz' 'merc' and 'benz' all match the right item as expected.

When searching for 'Mercedes-Be' I get a superfluous result: "Being
Cool With Bond, James Bond". The term is obviously being broken into
'mercedes' and 'be', the latter matching the start of "Being" however I
would rather the second word act to further limit the results presented to
the user (as is probably expected).

The results, settings and mapping are listed in the following gist:
Demonstrates two unwanted results with current elasticsearch setup. · GitHub

Could anyone offer any guidance on how to fix these issues?

Cheers,

Jon

--

Jonathan_Evans · January 16, 2013, 10:48am

Cheers Igor, this works great! I have a much better understanding after
your detailed breakdown as well.

One further undesirable thing I have noticed is that if I type Mercedes
Benz Drama it returns an item "primary drama". I cannot think of anyway
really to fix this as its the same sort of problem as the ordering problem
you mentioned.

Any ideas?

On Tuesday, January 15, 2013 4:40:39 PM UTC, Igor Motov wrote:

During indexing you are using standard tokenizer that splits words on "-".
So, 'Mercedes-Benz' is indexed like this:

$ curl -s
"localhost:9200/courses/_analyze?analyzer=autocomplete_analyzer&pretty=true"
-d "Mercedes-Benz" | grep "token"
"token" : "me",
"token" : "mer",
"token" : "merc",
"token" : "merce",
"token" : "merced",
"token" : "mercede",
"token" : "mercedes",
"token" : "be",
"token" : "ben",
"token" : "benz",

For the search part you are using keyword tokenizer, which doesn't
tokenize at all, so as a result the query for "mercedes-benz" is getting
translated into the query for the term "mercedes-benz":

$ curl -s -X GET '
http://localhost:9200/courses/course/_validate/query?pretty=true&explain=true'
-d '{
"query_string":{
"query":"mercedes-benz",
"fields":[
"name"
]
}
}' | grep "explanation"
"explanation" : "name:mercedes-benz"

There is no token "mercedes-benz" in the index, so you get no results.
When you replace "-" with space, the query_string parser splits the query
into to parts:

$ curl -s -X GET '
http://localhost:9200/courses/course/_validate/query?pretty=true&explain=true'
-d '{
"query_string":{
"query":"mercedes benz",
"fields":[
"name"
]
}
}' | grep "explanation"
"explanation" : "name:mercedes name:benz"

It searches for the term mercedes OR the term benz and you get expected
result. But because of this "OR" you are also finding "Being Cool..." when
you replace your query with "mercedes be".

To fix it, you should first replace query_string query with something that
wouldn't interfere with your tokenization. The Matchhttp://www.elasticsearch.org/guide/reference/query-dsl/match-query.htmlquery might be a good candidate. It still leaves the mismatch between
search tokenizer and index tokenizer to be addressed. There are few options
here. The simplest one is to replace the search analyzer with standard
analyzer with no stop words. That will work work in most case. The only
potential issue here is that it will disregard word order in you search. So
it will also find you "Being Cool" when you search for "Cool Being". Not
sure if this is something that you want to avoid or not.

On Tuesday, January 15, 2013 3:09:58 AM UTC-5, Jonathan Evans wrote:

I am trying to configure elasticsearch for autocomplete and have been
quite successful in doing so, however there are a couple of behaviours I
would like to tweak if possible.

When searching for 'Mercedes-Benz' no results are returned with the
current setup even though one of the indexed items contains the term.
'mercedes benz' 'merc' and 'benz' all match the right item as expected.

When searching for 'Mercedes-Be' I get a superfluous result: "Being
Cool With Bond, James Bond". The term is obviously being broken into
'mercedes' and 'be', the latter matching the start of "Being" however I
would rather the second word act to further limit the results presented to
the user (as is probably expected).

The results, settings and mapping are listed in the following gist:
Demonstrates two unwanted results with current elasticsearch setup. · GitHub

Could anyone offer any guidance on how to fix these issues?

Cheers,

Jon

--

polyfractal · January 16, 2013, 12:45pm

You could add an additional query like text_phrase that boosts phrase
matches. Alternatively, you could build phrase matches into your index
with something like shingles.

-Zach

On Wednesday, January 16, 2013 5:48:30 AM UTC-5, Jonathan Evans wrote:

Cheers Igor, this works great! I have a much better understanding after
your detailed breakdown as well.

One further undesirable thing I have noticed is that if I type Mercedes
Benz Drama it returns an item "primary drama". I cannot think of anyway
really to fix this as its the same sort of problem as the ordering problem
you mentioned.

Any ideas?

On Tuesday, January 15, 2013 4:40:39 PM UTC, Igor Motov wrote:

During indexing you are using standard tokenizer that splits words on
"-". So, 'Mercedes-Benz' is indexed like this:

$ curl -s
"localhost:9200/courses/_analyze?analyzer=autocomplete_analyzer&pretty=true"
-d "Mercedes-Benz" | grep "token"
"token" : "me",
"token" : "mer",
"token" : "merc",
"token" : "merce",
"token" : "merced",
"token" : "mercede",
"token" : "mercedes",
"token" : "be",
"token" : "ben",
"token" : "benz",

For the search part you are using keyword tokenizer, which doesn't
tokenize at all, so as a result the query for "mercedes-benz" is getting
translated into the query for the term "mercedes-benz":

$ curl -s -X GET '
http://localhost:9200/courses/course/_validate/query?pretty=true&explain=true'
-d '{
"query_string":{
"query":"mercedes-benz",
"fields":[
"name"
]
}
}' | grep "explanation"
"explanation" : "name:mercedes-benz"

There is no token "mercedes-benz" in the index, so you get no results.
When you replace "-" with space, the query_string parser splits the query
into to parts:

$ curl -s -X GET '
http://localhost:9200/courses/course/_validate/query?pretty=true&explain=true'
-d '{
"query_string":{
"query":"mercedes benz",
"fields":[
"name"
]
}
}' | grep "explanation"
"explanation" : "name:mercedes name:benz"

It searches for the term mercedes OR the term benz and you get expected
result. But because of this "OR" you are also finding "Being Cool..." when
you replace your query with "mercedes be".

To fix it, you should first replace query_string query with something
that wouldn't interfere with your tokenization. The Matchhttp://www.elasticsearch.org/guide/reference/query-dsl/match-query.htmlquery might be a good candidate. It still leaves the mismatch between
search tokenizer and index tokenizer to be addressed. There are few options
here. The simplest one is to replace the search analyzer with standard
analyzer with no stop words. That will work work in most case. The only
potential issue here is that it will disregard word order in you search. So
it will also find you "Being Cool" when you search for "Cool Being". Not
sure if this is something that you want to avoid or not.

On Tuesday, January 15, 2013 3:09:58 AM UTC-5, Jonathan Evans wrote:

I am trying to configure elasticsearch for autocomplete and have been
quite successful in doing so, however there are a couple of behaviours I
would like to tweak if possible.

When searching for 'Mercedes-Benz' no results are returned with the
current setup even though one of the indexed items contains the term.
'mercedes benz' 'merc' and 'benz' all match the right item as expected.

When searching for 'Mercedes-Be' I get a superfluous result: "Being
Cool With Bond, James Bond". The term is obviously being broken into
'mercedes' and 'be', the latter matching the start of "Being" however I
would rather the second word act to further limit the results presented to
the user (as is probably expected).

The results, settings and mapping are listed in the following gist:
Demonstrates two unwanted results with current elasticsearch setup. · GitHub

Could anyone offer any guidance on how to fix these issues?

Cheers,

Jon

--

Igor_Motov · January 16, 2013, 1:54pm

It happens because match is using OR operator by default as well. But
that's easy to fix. Just change operator to "and" in your match query like
this:

"match":{
  "name":{
    "query": "Mercedes-Benz",

```
   "operator": "and"*
}
```
}

On Wednesday, January 16, 2013 5:48:30 AM UTC-5, Jonathan Evans wrote:

Cheers Igor, this works great! I have a much better understanding after
your detailed breakdown as well.

One further undesirable thing I have noticed is that if I type Mercedes
Benz Drama it returns an item "primary drama". I cannot think of anyway
really to fix this as its the same sort of problem as the ordering problem
you mentioned.

Any ideas?

On Tuesday, January 15, 2013 4:40:39 PM UTC, Igor Motov wrote:

During indexing you are using standard tokenizer that splits words on
"-". So, 'Mercedes-Benz' is indexed like this:

$ curl -s
"localhost:9200/courses/_analyze?analyzer=autocomplete_analyzer&pretty=true"
-d "Mercedes-Benz" | grep "token"
"token" : "me",
"token" : "mer",
"token" : "merc",
"token" : "merce",
"token" : "merced",
"token" : "mercede",
"token" : "mercedes",
"token" : "be",
"token" : "ben",
"token" : "benz",

For the search part you are using keyword tokenizer, which doesn't
tokenize at all, so as a result the query for "mercedes-benz" is getting
translated into the query for the term "mercedes-benz":

$ curl -s -X GET '
http://localhost:9200/courses/course/_validate/query?pretty=true&explain=true'
-d '{
"query_string":{
"query":"mercedes-benz",
"fields":[
"name"
]
}
}' | grep "explanation"
"explanation" : "name:mercedes-benz"

There is no token "mercedes-benz" in the index, so you get no results.
When you replace "-" with space, the query_string parser splits the query
into to parts:

$ curl -s -X GET '
http://localhost:9200/courses/course/_validate/query?pretty=true&explain=true'
-d '{
"query_string":{
"query":"mercedes benz",
"fields":[
"name"
]
}
}' | grep "explanation"
"explanation" : "name:mercedes name:benz"

It searches for the term mercedes OR the term benz and you get expected
result. But because of this "OR" you are also finding "Being Cool..." when
you replace your query with "mercedes be".

To fix it, you should first replace query_string query with something
that wouldn't interfere with your tokenization. The Matchhttp://www.elasticsearch.org/guide/reference/query-dsl/match-query.htmlquery might be a good candidate. It still leaves the mismatch between
search tokenizer and index tokenizer to be addressed. There are few options
here. The simplest one is to replace the search analyzer with standard
analyzer with no stop words. That will work work in most case. The only
potential issue here is that it will disregard word order in you search. So
it will also find you "Being Cool" when you search for "Cool Being". Not
sure if this is something that you want to avoid or not.

On Tuesday, January 15, 2013 3:09:58 AM UTC-5, Jonathan Evans wrote:

I am trying to configure elasticsearch for autocomplete and have been
quite successful in doing so, however there are a couple of behaviours I
would like to tweak if possible.

When searching for 'Mercedes-Benz' no results are returned with the
current setup even though one of the indexed items contains the term.
'mercedes benz' 'merc' and 'benz' all match the right item as expected.

When searching for 'Mercedes-Be' I get a superfluous result: "Being
Cool With Bond, James Bond". The term is obviously being broken into
'mercedes' and 'be', the latter matching the start of "Being" however I
would rather the second word act to further limit the results presented to
the user (as is probably expected).

The results, settings and mapping are listed in the following gist:
Demonstrates two unwanted results with current elasticsearch setup. · GitHub

Could anyone offer any guidance on how to fix these issues?

Cheers,

Jon

--

btiernay · January 17, 2013, 3:19am

Awesome response!

--

Topic		Replies	Views
Issues with elasticsearch + autocomplete Elasticsearch	10	1085	July 6, 2017
Auto complete Elasticsearch	9	655	July 6, 2017
Autocompletion Elasticsearch	18	952	July 6, 2017
Advice on my approach to this search problem Elasticsearch	11	526	July 6, 2017
Implementing "search as you type" example Elasticsearch	4	637	July 6, 2017

Hypenation and superfluous results with ngram analyser for autocomplete

Related topics