Issues with elasticsearch + autocomplete

Niranjan · August 16, 2012, 3:08pm

Hello everyone,

I have implemented an autocomplete feature to my search input box and am
using ngram filter for this.
My index_analyzer is ngram and search_analyzer is standard. Along with
this, I also use highlight, so that the matches are highlighted in a
different color in the autocomplete.

I have this problem. I have these 2 sentences I'm searching against - "Play
baseball", "Play tennis". When I type "P", "Pl", "Pla", the 2 sentences get
matched. But when I type "Play", I get 0 matches. If I continue typing to
"Play B", then onwards it rightly matches "Play baseball".

Why is this so?

Another problem- I have a word I'm searching against "Trekking". When I
type the trekking, it rightly matches the word but only 'trek' is
highlighted and it ignores to highlight 'ing'.
eg. when I type 'trek', 'trekk', 'trekki', 'trekkin' the word is rightly
matched and it rightly highlights the letters I have typed. When I type
'trekking', only 'trek' is highlighted. Why is this? These issues are
driving me up the wall

Regards

--

Clinton_Gormley · August 16, 2012, 7:28pm

Hi Niranjan

I have implemented an autocomplete feature to my search input box and
am using ngram filter for this.
My index_analyzer is ngram and search_analyzer is standard. Along with
this, I also use highlight, so that the matches are highlighted in a
different color in the autocomplete.

I have this problem. I have these 2 sentences I'm searching against -
"Play baseball", "Play tennis". When I type "P", "Pl", "Pla", the 2
sentences get matched. But when I type "Play", I get 0 matches. If I
continue typing to "Play B", then onwards it rightly matches "Play
baseball".

It's much better to gist a curl recreation of the problem so that we can
see exactly what you're doing, and what you could change. With the info
we have so far, we'd just be guessing.

See Elasticsearch Platform — Find real-time answers at scale | Elastic for advice about how gist a
recreation

clint

--

Niranjan · August 16, 2012, 7:50pm

Hi Clinton,

Thanks for the tip.
My gist recreation is here: gist:3372990 · GitHub
My mapping is here: gist:3373031 · GitHub

Regards,
Niranjan

On 17 August 2012 00:58, Clinton Gormley clint@traveljury.com wrote:

Hi Niranjan

I have implemented an autocomplete feature to my search input box and
am using ngram filter for this.
My index_analyzer is ngram and search_analyzer is standard. Along with
this, I also use highlight, so that the matches are highlighted in a
different color in the autocomplete.

I have this problem. I have these 2 sentences I'm searching against -
"Play baseball", "Play tennis". When I type "P", "Pl", "Pla", the 2
sentences get matched. But when I type "Play", I get 0 matches. If I
continue typing to "Play B", then onwards it rightly matches "Play
baseball".

It's much better to gist a curl recreation of the problem so that we can
see exactly what you're doing, and what you could change. With the info
we have so far, we'd just be guessing.

See Elasticsearch Platform — Find real-time answers at scale | Elastic for advice about how gist a
recreation

clint

--

--

Clinton_Gormley · August 16, 2012, 8:40pm

On Fri, 2012-08-17 at 01:20 +0530, Niranjan U wrote:

Hi Clinton,

Thanks for the tip.
My gist recreation is here: gist:3372990 · GitHub

Ummm... You've gisted your cluster health...

When I say "gist a curl recreation" I mean something that we can copy
and paste and so recreate your problem

clint

--

dadoonet · August 16, 2012, 8:52pm

Hi Clint.

Problem is that there is an example in help page: curl -XGET 'http://127.0.0.1:9200/_cluster/health?pretty=true'

Some users think that they need to give the cluster health.

We should enhance this part of the help page and tell users exactly what you wrote: "something that we can copy and paste and so recreate your problem"

I think we should add a link to a Gist example with full script:
Delete index
Create index
Index some docs
Refresh indice
Query

What do you think? I can work on that.

David

--

Le 16 août 2012 à 22:40, Clinton Gormley clint@traveljury.com a écrit :

On Fri, 2012-08-17 at 01:20 +0530, Niranjan U wrote:

Hi Clinton,

Thanks for the tip.
My gist recreation is here: gist:3372990 · GitHub

Ummm... You've gisted your cluster health...

When I say "gist a curl recreation" I mean something that we can copy
and paste and so recreate your problem

clint

--

--

Clinton_Gormley · August 16, 2012, 9:01pm

Hi David

Problem is that there is an example in help page: curl -XGET
'http://127.0.0.1:9200/_cluster/health?pretty=true'

Some users think that they need to give the cluster health.

heh ok

I wrote that page a long time ago, and haven't looked at it since. I've
just assumed that I expressed myself clearly the first time. Clearly
not

We should enhance this part of the help page and tell users exactly
what you wrote: "something that we can copy and paste and so recreate
your problem"

Indeed!

I think we should add a link to a Gist example with full script:
Delete index
Create index
Index some docs
Refresh indice
Query

What do you think? I can work on that.

That'd be great. thanks

clint

--

Niranjan · August 16, 2012, 11:09pm

Hi Clinton,

Sorry about that. I am sending gist of some of the curl operations. Hope
this will shed more light on the issue I am facing.

Firstly, here is the gist of the mapping I am using:
gist:3373031 · GitHub
As you can see, my index type "interests" has ngram as my
index_analyzer and "standard" as search_analyzer. I am also adding porter
stem filter to "my_ngram" analyzer to be able to search over stem words
Here are some gist recreations of the issues I am facing:

(a) When I search for "trek": and enable highlighting, this is the
result: gist:3374311 · GitHub
"trekking" gets highlighted, when in fact only "trek" should be
highlighted, that being my input. Thus, my autocomplete
input box has mismatched highlighting
(b) When I search for "bas", this is the result:
gist:3374333 · GitHub
Check the highlight result, "basketba" is getting highlighted.
Again, my autocomplete is getting mismatched highlighting

(c) Now this one is really weird. I am typing the word entrepreneurs
in my text box.
When I type "entrep", I rightly get the result with the
highlight working as expected: gist:3374351 · GitHub
```
    However, when I type "entreprene", it still shows the result but
```

highlight doesn't work on the text:
Gist: gist:3374359 · GitHub

        Now, weird part is, when I type the full word "entrepreneurs",

I get 0 results!! gist:3374366 · GitHub

I am totally stumped by the kind of results I am getting, any suggestions
please?

Regards,
Niranjan

On 17 August 2012 02:31, Clinton Gormley clint@traveljury.com wrote:

Hi David

Problem is that there is an example in help page: curl -XGET
'http://127.0.0.1:9200/_cluster/health?pretty=true'

Some users think that they need to give the cluster health.

heh ok

I wrote that page a long time ago, and haven't looked at it since. I've
just assumed that I expressed myself clearly the first time. Clearly
not

We should enhance this part of the help page and tell users exactly
what you wrote: "something that we can copy and paste and so recreate
your problem"

Indeed!

I think we should add a link to a Gist example with full script:
Delete index
Create index
Index some docs
Refresh indice
Query

What do you think? I can work on that.

That'd be great. thanks

clint

--

--

Clinton_Gormley · August 17, 2012, 9:36am

Hi Niranjan

Sorry about that. I am sending gist of some of the curl operations.
Hope this will shed more light on the issue I am facing.

Much better!

Firstly, here is the gist of the mapping I am using:
gist:3373031 · GitHub
As you can see, my index type "interests" has ngram as my
index_analyzer and "standard" as search_analyzer. I am also adding
porter stem filter to "my_ngram" analyzer to be able to search over
stem words

OK - there are several things wrong with this mapping:

you are using a blunderbuss approach - trying to make EVERYTHING
autocomplete and stemmed and and. Rather enable these things
selectively, where you really need it
Rather use edge-ngrams instead of ngrams. People expect
'tre' to match 'trekking', not 'contretemps'.
There is no need to combine porter-stem with ngrams.
Porter-stem may convert (eg) 'camping' to 'camp'.
Edge-ngrams would give you:
c, ca, cam, camp, campi, campin, camping
Combined with stemming, you'd just get:
c, ca, cam, camp
I think that's where your entrpreneurs search is going wrong
Use the same analyzer at search and index time, otherwise there
is a good chance that you'll be searching for stuff which just
isn't there! Also, different analyzers order tokens differently,
which can affect results.

A better mapping is gist:992e0e704e035e8a1770 · GitHub

Note: I'm using a multi-field for "name" with two versions: one analyzed
with the 'english' analyzer, and one with edge_ngrams.

The basic query looks like this:

curl -XGET 'http://127.0.0.1:9200/test/interests/_search?pretty=1' -d '
{
"query" : {
"bool" : {
"should" : [
{
"text" : {
"name.ngrams" : {
"operator" : "and",
"query" : "SEARCH TERMS"
}
}
},
{
"text" : {
"name" : "SEARCH TERMS"
}
}
]
}
},
"highlight" : {
"fields" : {
"name.ngrams" : {},
"name" : {}
}
}
}
'

I'm querying the 'name.ngrams' field (which will get the partial words')
and I'm also querying the 'name' (or 'name.name') field which will match
any full words, and increase their relevance.

Note: i'm using the 'and' operator for the ngrams, otherwise 'tre' will
match anything that contains just 't'.

I'm highlighting both fields.

Here are some results:

For 'tre':
"name.ngrams" : "trekking"

For 'trek':

   "name.ngrams" : "<em>trek</em>king"
   "name"        : "<em>trekking</em>"

# note 'trekking' has been highlighted because the stemmed term
# is 'trek'

For 'bas':
"name.ngrams" : "play basketball"

For 'basketbal':

   "name.ngrams" : "play <em>basketbal</em>l"
   "name"        : "play <em>basketball</em>"

For 'entrepreneurs';

   "name.ngrams" : "meet <em>entreprene</em>urs"
   "name"        : "meet <em>entrepreneurs</em>"

   # note how the ngram match stops at 10 letters - that's because
   # your max_gram was set to 10.

hth

clint

--

Niranjan · August 18, 2012, 1:04pm

Sincerely appreciate the explanations given and the examples. Got to
understand things better! Thanks, this works like a charm!

On 17 August 2012 15:06, Clinton Gormley clint@traveljury.com wrote:

Hi Niranjan

Sorry about that. I am sending gist of some of the curl operations.
Hope this will shed more light on the issue I am facing.

Much better!

Firstly, here is the gist of the mapping I am using:
gist:3373031 · GitHub
As you can see, my index type "interests" has ngram as my
index_analyzer and "standard" as search_analyzer. I am also adding
porter stem filter to "my_ngram" analyzer to be able to search over
stem words

OK - there are several things wrong with this mapping:

you are using a blunderbuss approach - trying to make EVERYTHING
autocomplete and stemmed and and. Rather enable these things
selectively, where you really need it

Rather use edge-ngrams instead of ngrams. People expect
'tre' to match 'trekking', not 'contretemps'.

There is no need to combine porter-stem with ngrams.
Porter-stem may convert (eg) 'camping' to 'camp'.
Edge-ngrams would give you:
c, ca, cam, camp, campi, campin, camping
Combined with stemming, you'd just get:
c, ca, cam, camp
I think that's where your entrpreneurs search is going wrong

Use the same analyzer at search and index time, otherwise there
is a good chance that you'll be searching for stuff which just
isn't there! Also, different analyzers order tokens differently,
which can affect results.

A better mapping is gist:992e0e704e035e8a1770 · GitHub

Note: I'm using a multi-field for "name" with two versions: one analyzed
with the 'english' analyzer, and one with edge_ngrams.

The basic query looks like this:

curl -XGET 'http://127.0.0.1:9200/test/interests/_search?pretty=1' -d '
{
"query" : {
"bool" : {
"should" : [
{
"text" : {
"name.ngrams" : {
"operator" : "and",
"query" : "SEARCH TERMS"
}
}
},
{
"text" : {
"name" : "SEARCH TERMS"
}
}
]
}
},
"highlight" : {
"fields" : {
"name.ngrams" : {},
"name" : {}
}
}
}
'

I'm querying the 'name.ngrams' field (which will get the partial words')
and I'm also querying the 'name' (or 'name.name') field which will match
any full words, and increase their relevance.

Note: i'm using the 'and' operator for the ngrams, otherwise 'tre' will
match anything that contains just 't'.

I'm highlighting both fields.

Here are some results:

For 'tre':
"name.ngrams" : "trekking"

For 'trek':
 "name.ngrams" : "trekking"
 "name" : "trekking"

# note 'trekking' has been highlighted because the stemmed term
# is 'trek'
For 'bas':
"name.ngrams" : "play basketball"

For 'basketbal':
 "name.ngrams" : "play basketball"
 "name" : "play basketball"
For 'entrepreneurs';
 "name.ngrams" : "meet entrepreneurs"
 "name" : "meet entrepreneurs"

 # note how the ngram match stops at 10 letters - that's because
 # your max_gram was set to 10.
hth

clint

--

--

adhenawer · September 14, 2012, 7:49pm

Hi, your post was very useful.
I wonder of the hightligth of ngrams is applied to compound words, example.
" samsung gala " from "samsung galaxy".
I did exactly as shown but took only one word.

Thanks!

Em sexta-feira, 17 de agosto de 2012 06h36min50s UTC-3, Clinton Gormley
escreveu:

Hi Niranjan

Sorry about that. I am sending gist of some of the curl operations.
Hope this will shed more light on the issue I am facing.

Much better!

Firstly, here is the gist of the mapping I am using:
gist:3373031 · GitHub
As you can see, my index type "interests" has ngram as my
index_analyzer and "standard" as search_analyzer. I am also adding
porter stem filter to "my_ngram" analyzer to be able to search over
stem words

OK - there are several things wrong with this mapping:

you are using a blunderbuss approach - trying to make EVERYTHING
autocomplete and stemmed and and. Rather enable these things
selectively, where you really need it

Rather use edge-ngrams instead of ngrams. People expect
'tre' to match 'trekking', not 'contretemps'.

There is no need to combine porter-stem with ngrams.
Porter-stem may convert (eg) 'camping' to 'camp'.
Edge-ngrams would give you:
c, ca, cam, camp, campi, campin, camping
Combined with stemming, you'd just get:
c, ca, cam, camp
I think that's where your entrpreneurs search is going wrong

Use the same analyzer at search and index time, otherwise there
is a good chance that you'll be searching for stuff which just
isn't there! Also, different analyzers order tokens differently,
which can affect results.

A better mapping is gist:992e0e704e035e8a1770 · GitHub

Note: I'm using a multi-field for "name" with two versions: one analyzed
with the 'english' analyzer, and one with edge_ngrams.

The basic query looks like this:

curl -XGET 'http://127.0.0.1:9200/test/interests/_search?pretty=1' -d '
{
"query" : {
"bool" : {
"should" : [
{
"text" : {
"name.ngrams" : {
"operator" : "and",
"query" : "SEARCH TERMS"
}
}
},
{
"text" : {
"name" : "SEARCH TERMS"
}
}
]
}
},
"highlight" : {
"fields" : {
"name.ngrams" : {},
"name" : {}
}
}
}
'

I'm querying the 'name.ngrams' field (which will get the partial words')
and I'm also querying the 'name' (or 'name.name') field which will match
any full words, and increase their relevance.

Note: i'm using the 'and' operator for the ngrams, otherwise 'tre' will
match anything that contains just 't'.

I'm highlighting both fields.

Here are some results:

For 'tre':
"name.ngrams" : "trekking"

For 'trek':

"name.ngrams" : "trekking" "name" : "trekking" # note 'trekking' has been highlighted because the stemmed term # is 'trek'

For 'bas':
"name.ngrams" : "play basketball"

For 'basketbal':

"name.ngrams" : "play basketball" "name" : "play basketball"

For 'entrepreneurs';

"name.ngrams" : "meet entrepreneurs" "name" : "meet entrepreneurs" # note how the ngram match stops at 10 letters - that's because # your max_gram was set to 10.

hth

clint

--

Topic		Replies	Views
Highlight works not always! Elasticsearch	2	340	July 6, 2017
Highlight works not always! Elasticsearch	1	311	July 6, 2017
Search - keyword vs text and ngram Elasticsearch	1	358	May 27, 2021
Highlighting on ngram search Elasticsearch	1	1016	March 19, 2020
Highlighting bug in 0.90.0 and possibly 0.90.1? Elasticsearch	2	331	July 6, 2017

Issues with elasticsearch + autocomplete

Related topics