Issues with elasticsearch + autocomplete

Hello everyone,

I have implemented an autocomplete feature to my search input box and am
using ngram filter for this.
My index_analyzer is ngram and search_analyzer is standard. Along with
this, I also use highlight, so that the matches are highlighted in a
different color in the autocomplete.

I have this problem. I have these 2 sentences I'm searching against - "Play
baseball", "Play tennis". When I type "P", "Pl", "Pla", the 2 sentences get
matched. But when I type "Play", I get 0 matches. If I continue typing to
"Play B", then onwards it rightly matches "Play baseball".

Why is this so?

Another problem- I have a word I'm searching against "Trekking". When I
type the trekking, it rightly matches the word but only 'trek' is
highlighted and it ignores to highlight 'ing'.
eg. when I type 'trek', 'trekk', 'trekki', 'trekkin' the word is rightly
matched and it rightly highlights the letters I have typed. When I type
'trekking', only 'trek' is highlighted. Why is this? These issues are
driving me up the wall :frowning:

Regards

--

Hi Niranjan

I have implemented an autocomplete feature to my search input box and
am using ngram filter for this.
My index_analyzer is ngram and search_analyzer is standard. Along with
this, I also use highlight, so that the matches are highlighted in a
different color in the autocomplete.

I have this problem. I have these 2 sentences I'm searching against -
"Play baseball", "Play tennis". When I type "P", "Pl", "Pla", the 2
sentences get matched. But when I type "Play", I get 0 matches. If I
continue typing to "Play B", then onwards it rightly matches "Play
baseball".

It's much better to gist a curl recreation of the problem so that we can
see exactly what you're doing, and what you could change. With the info
we have so far, we'd just be guessing.

See Elasticsearch Platform — Find real-time answers at scale | Elastic for advice about how gist a
recreation

clint

--

Hi Clinton,

Thanks for the tip.
My gist recreation is here: gist:3372990 · GitHub
My mapping is here: gist:3373031 · GitHub

Regards,
Niranjan

On 17 August 2012 00:58, Clinton Gormley clint@traveljury.com wrote:

Hi Niranjan

I have implemented an autocomplete feature to my search input box and
am using ngram filter for this.
My index_analyzer is ngram and search_analyzer is standard. Along with
this, I also use highlight, so that the matches are highlighted in a
different color in the autocomplete.

I have this problem. I have these 2 sentences I'm searching against -
"Play baseball", "Play tennis". When I type "P", "Pl", "Pla", the 2
sentences get matched. But when I type "Play", I get 0 matches. If I
continue typing to "Play B", then onwards it rightly matches "Play
baseball".

It's much better to gist a curl recreation of the problem so that we can
see exactly what you're doing, and what you could change. With the info
we have so far, we'd just be guessing.

See Elasticsearch Platform — Find real-time answers at scale | Elastic for advice about how gist a
recreation

clint

--

--

On Fri, 2012-08-17 at 01:20 +0530, Niranjan U wrote:

Hi Clinton,

Thanks for the tip.
My gist recreation is here: gist:3372990 · GitHub

Ummm... You've gisted your cluster health...

When I say "gist a curl recreation" I mean something that we can copy
and paste and so recreate your problem

clint

--

Hi Clint.

Problem is that there is an example in help page: curl -XGET 'http://127.0.0.1:9200/_cluster/health?pretty=true'

Some users think that they need to give the cluster health.

We should enhance this part of the help page and tell users exactly what you wrote: "something that we can copy and paste and so recreate your problem"

I think we should add a link to a Gist example with full script:
Delete index
Create index
Index some docs
Refresh indice
Query

What do you think? I can work on that.

David

--

Le 16 août 2012 à 22:40, Clinton Gormley clint@traveljury.com a écrit :

On Fri, 2012-08-17 at 01:20 +0530, Niranjan U wrote:

Hi Clinton,

Thanks for the tip.
My gist recreation is here: gist:3372990 · GitHub

Ummm... You've gisted your cluster health...

When I say "gist a curl recreation" I mean something that we can copy
and paste and so recreate your problem

clint

--

--

Hi David

Problem is that there is an example in help page: curl -XGET
'http://127.0.0.1:9200/_cluster/health?pretty=true'

Some users think that they need to give the cluster health.

heh ok

I wrote that page a long time ago, and haven't looked at it since. I've
just assumed that I expressed myself clearly the first time. Clearly
not :slight_smile:

We should enhance this part of the help page and tell users exactly
what you wrote: "something that we can copy and paste and so recreate
your problem"

Indeed!

I think we should add a link to a Gist example with full script:
Delete index
Create index
Index some docs
Refresh indice
Query

What do you think? I can work on that.

That'd be great. thanks

clint

--

Hi Clinton,

Sorry about that. I am sending gist of some of the curl operations. Hope
this will shed more light on the issue I am facing.

  1. Firstly, here is the gist of the mapping I am using:
    gist:3373031 · GitHub
    As you can see, my index type "interests" has ngram as my
    index_analyzer and "standard" as search_analyzer. I am also adding porter
    stem filter to "my_ngram" analyzer to be able to search over stem words

  2. Here are some gist recreations of the issues I am facing:

    (a) When I search for "trek": and enable highlighting, this is the
    result: gist:3374311 · GitHub
    "trekking" gets highlighted, when in fact only "trek" should be
    highlighted, that being my input. Thus, my autocomplete
    input box has mismatched highlighting
    (b) When I search for "bas", this is the result:
    gist:3374333 · GitHub
    Check the highlight result, "basketba" is getting highlighted.
    Again, my autocomplete is getting mismatched highlighting

    (c) Now this one is really weird. I am typing the word entrepreneurs
    in my text box.
    When I type "entrep", I rightly get the result with the
    highlight working as expected: gist:3374351 · GitHub

        However, when I type "entreprene", it still shows the result but
    

highlight doesn't work on the text:
Gist: gist:3374359 · GitHub

        Now, weird part is, when I type the full word "entrepreneurs",

I get 0 results!! gist:3374366 · GitHub

I am totally stumped by the kind of results I am getting, any suggestions
please?

Regards,
Niranjan

On 17 August 2012 02:31, Clinton Gormley clint@traveljury.com wrote:

Hi David

Problem is that there is an example in help page: curl -XGET
'http://127.0.0.1:9200/_cluster/health?pretty=true'

Some users think that they need to give the cluster health.

heh ok

I wrote that page a long time ago, and haven't looked at it since. I've
just assumed that I expressed myself clearly the first time. Clearly
not :slight_smile:

We should enhance this part of the help page and tell users exactly
what you wrote: "something that we can copy and paste and so recreate
your problem"

Indeed!

I think we should add a link to a Gist example with full script:
Delete index
Create index
Index some docs
Refresh indice
Query

What do you think? I can work on that.

That'd be great. thanks

clint

--

--

Hi Niranjan

Sorry about that. I am sending gist of some of the curl operations.
Hope this will shed more light on the issue I am facing.

Much better!

  1. Firstly, here is the gist of the mapping I am using:
    gist:3373031 · GitHub
    As you can see, my index type "interests" has ngram as my
    index_analyzer and "standard" as search_analyzer. I am also adding
    porter stem filter to "my_ngram" analyzer to be able to search over
    stem words

OK - there are several things wrong with this mapping:

  1. you are using a blunderbuss approach - trying to make EVERYTHING
    autocomplete and stemmed and and. Rather enable these things
    selectively, where you really need it

  2. Rather use edge-ngrams instead of ngrams. People expect
    'tre' to match 'trekking', not 'contretemps'.

  3. There is no need to combine porter-stem with ngrams.
    Porter-stem may convert (eg) 'camping' to 'camp'.
    Edge-ngrams would give you:
    c, ca, cam, camp, campi, campin, camping
    Combined with stemming, you'd just get:
    c, ca, cam, camp
    I think that's where your entrpreneurs search is going wrong

  4. Use the same analyzer at search and index time, otherwise there
    is a good chance that you'll be searching for stuff which just
    isn't there! Also, different analyzers order tokens differently,
    which can affect results.

A better mapping is gist:992e0e704e035e8a1770 · GitHub

Note: I'm using a multi-field for "name" with two versions: one analyzed
with the 'english' analyzer, and one with edge_ngrams.

The basic query looks like this:

curl -XGET 'http://127.0.0.1:9200/test/interests/_search?pretty=1' -d '
{
"query" : {
"bool" : {
"should" : [
{
"text" : {
"name.ngrams" : {
"operator" : "and",
"query" : "SEARCH TERMS"
}
}
},
{
"text" : {
"name" : "SEARCH TERMS"
}
}
]
}
},
"highlight" : {
"fields" : {
"name.ngrams" : {},
"name" : {}
}
}
}
'

I'm querying the 'name.ngrams' field (which will get the partial words')
and I'm also querying the 'name' (or 'name.name') field which will match
any full words, and increase their relevance.

Note: i'm using the 'and' operator for the ngrams, otherwise 'tre' will
match anything that contains just 't'.

I'm highlighting both fields.

Here are some results:

For 'tre':
"name.ngrams" : "trekking"

For 'trek':

   "name.ngrams" : "<em>trek</em>king"
   "name"        : "<em>trekking</em>"

# note 'trekking' has been highlighted because the stemmed term
# is 'trek'

For 'bas':
"name.ngrams" : "play basketball"

For 'basketbal':

   "name.ngrams" : "play <em>basketbal</em>l"
   "name"        : "play <em>basketball</em>"

For 'entrepreneurs';

   "name.ngrams" : "meet <em>entreprene</em>urs"
   "name"        : "meet <em>entrepreneurs</em>"

   # note how the ngram match stops at 10 letters - that's because
   # your max_gram was set to 10. 

hth

clint

--

Sincerely appreciate the explanations given and the examples. Got to
understand things better! Thanks, this works like a charm!

On 17 August 2012 15:06, Clinton Gormley clint@traveljury.com wrote:

Hi Niranjan

Sorry about that. I am sending gist of some of the curl operations.
Hope this will shed more light on the issue I am facing.

Much better!

  1. Firstly, here is the gist of the mapping I am using:
    gist:3373031 · GitHub
    As you can see, my index type "interests" has ngram as my
    index_analyzer and "standard" as search_analyzer. I am also adding
    porter stem filter to "my_ngram" analyzer to be able to search over
    stem words

OK - there are several things wrong with this mapping:

  1. you are using a blunderbuss approach - trying to make EVERYTHING
    autocomplete and stemmed and and. Rather enable these things
    selectively, where you really need it

  2. Rather use edge-ngrams instead of ngrams. People expect
    'tre' to match 'trekking', not 'contretemps'.

  3. There is no need to combine porter-stem with ngrams.
    Porter-stem may convert (eg) 'camping' to 'camp'.
    Edge-ngrams would give you:
    c, ca, cam, camp, campi, campin, camping
    Combined with stemming, you'd just get:
    c, ca, cam, camp
    I think that's where your entrpreneurs search is going wrong

  4. Use the same analyzer at search and index time, otherwise there
    is a good chance that you'll be searching for stuff which just
    isn't there! Also, different analyzers order tokens differently,
    which can affect results.

A better mapping is gist:992e0e704e035e8a1770 · GitHub

Note: I'm using a multi-field for "name" with two versions: one analyzed
with the 'english' analyzer, and one with edge_ngrams.

The basic query looks like this:

curl -XGET 'http://127.0.0.1:9200/test/interests/_search?pretty=1' -d '
{
"query" : {
"bool" : {
"should" : [
{
"text" : {
"name.ngrams" : {
"operator" : "and",
"query" : "SEARCH TERMS"
}
}
},
{
"text" : {
"name" : "SEARCH TERMS"
}
}
]
}
},
"highlight" : {
"fields" : {
"name.ngrams" : {},
"name" : {}
}
}
}
'

I'm querying the 'name.ngrams' field (which will get the partial words')
and I'm also querying the 'name' (or 'name.name') field which will match
any full words, and increase their relevance.

Note: i'm using the 'and' operator for the ngrams, otherwise 'tre' will
match anything that contains just 't'.

I'm highlighting both fields.

Here are some results:

For 'tre':
"name.ngrams" : "trekking"

For 'trek':

   "name.ngrams" : "<em>trek</em>king"
   "name"        : "<em>trekking</em>"

# note 'trekking' has been highlighted because the stemmed term
# is 'trek'

For 'bas':
"name.ngrams" : "play basketball"

For 'basketbal':

   "name.ngrams" : "play <em>basketbal</em>l"
   "name"        : "play <em>basketball</em>"

For 'entrepreneurs';

   "name.ngrams" : "meet <em>entreprene</em>urs"
   "name"        : "meet <em>entrepreneurs</em>"

   # note how the ngram match stops at 10 letters - that's because
   # your max_gram was set to 10.

hth

clint

--

--

Hi, your post was very useful.
I wonder of the hightligth of ngrams is applied to compound words, example.
" samsung gala " from "samsung galaxy".
I did exactly as shown but took only one word.

Thanks!

Em sexta-feira, 17 de agosto de 2012 06h36min50s UTC-3, Clinton Gormley
escreveu:

Hi Niranjan

Sorry about that. I am sending gist of some of the curl operations.
Hope this will shed more light on the issue I am facing.

Much better!

  1. Firstly, here is the gist of the mapping I am using:
    gist:3373031 · GitHub
    As you can see, my index type "interests" has ngram as my
    index_analyzer and "standard" as search_analyzer. I am also adding
    porter stem filter to "my_ngram" analyzer to be able to search over
    stem words

OK - there are several things wrong with this mapping:

  1. you are using a blunderbuss approach - trying to make EVERYTHING
    autocomplete and stemmed and and. Rather enable these things
    selectively, where you really need it

  2. Rather use edge-ngrams instead of ngrams. People expect
    'tre' to match 'trekking', not 'contretemps'.

  3. There is no need to combine porter-stem with ngrams.
    Porter-stem may convert (eg) 'camping' to 'camp'.
    Edge-ngrams would give you:
    c, ca, cam, camp, campi, campin, camping
    Combined with stemming, you'd just get:
    c, ca, cam, camp
    I think that's where your entrpreneurs search is going wrong

  4. Use the same analyzer at search and index time, otherwise there
    is a good chance that you'll be searching for stuff which just
    isn't there! Also, different analyzers order tokens differently,
    which can affect results.

A better mapping is gist:992e0e704e035e8a1770 · GitHub

Note: I'm using a multi-field for "name" with two versions: one analyzed
with the 'english' analyzer, and one with edge_ngrams.

The basic query looks like this:

curl -XGET 'http://127.0.0.1:9200/test/interests/_search?pretty=1' -d '
{
"query" : {
"bool" : {
"should" : [
{
"text" : {
"name.ngrams" : {
"operator" : "and",
"query" : "SEARCH TERMS"
}
}
},
{
"text" : {
"name" : "SEARCH TERMS"
}
}
]
}
},
"highlight" : {
"fields" : {
"name.ngrams" : {},
"name" : {}
}
}
}
'

I'm querying the 'name.ngrams' field (which will get the partial words')
and I'm also querying the 'name' (or 'name.name') field which will match
any full words, and increase their relevance.

Note: i'm using the 'and' operator for the ngrams, otherwise 'tre' will
match anything that contains just 't'.

I'm highlighting both fields.

Here are some results:

For 'tre':
"name.ngrams" : "trekking"

For 'trek':

   "name.ngrams" : "<em>trek</em>king" 
   "name"        : "<em>trekking</em>" 

# note 'trekking' has been highlighted because the stemmed term 
# is 'trek' 

For 'bas':
"name.ngrams" : "play basketball"

For 'basketbal':

   "name.ngrams" : "play <em>basketbal</em>l" 
   "name"        : "play <em>basketball</em>" 

For 'entrepreneurs';

   "name.ngrams" : "meet <em>entreprene</em>urs" 
   "name"        : "meet <em>entrepreneurs</em>" 

   # note how the ngram match stops at 10 letters - that's because 
   # your max_gram was set to 10. 

hth

clint

--