Highlighting not working for [edge_]ngram with the new versions

Guillermo_Arias_del_ · November 19, 2013, 10:52am

Hi, all!

We recently updated from 0.90.1 to 0.90.6 and our highlighting tests began
to fail. Updating to 0.90.7 didn't work, so I think there is a bug, or at
least something changed in the specification...

I added a Gist to reproduce the problem:

gist.github.com

https://gist.github.com/ariasdelrio/7543562

gistfile1.txt

curl -s -X DELETE "http://localhost:9200/test_highlight" > /dev/null
curl -s -X PUT "http://localhost:9200/test_highlight" -d '
{
  "settings" : {
    "index": {
      "number_of_shards" : 1,
      "analysis": {
        "analyzer": {
          "icu_analyzer" : {
             "type": "custom",

This file has been truncated. show original

Here, I am searching for "te" in a field containing "some text to
highlight". The field is folded and tokenized using the ICU plug-in. The
value I expect (and got in earlier versions) is "some text to
highlight", but the returned value is "some text to
highlight". I checked with the three highlighters (I only need the
postings highlighter, but I thought I should check with the others as well).

Is this a bug or am I missing something?

Thanks,
Guillermo.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Guillermo_Arias_del_ · November 19, 2013, 1:18pm

I found an answer to this:

github.com/elastic/elasticsearch

elasticsearch highlights entire word instead of just the query when ngram filter is used

opened 03:32AM - 05 Jun 13 UTC

closed 06:52AM - 06 Jun 13 UTC

ike-bloomfire

when using an nGram filter on a field or index, if you try to highlight said fie…ld (or a field in an index that has an nGram filter defined on it, in search results, elasticsearch highlights the entire word instead of just the query. so if I have the text "American" and I search for "rican" highlighting should look like this ----> Ame **rican** but instead it does this ---> **American** To see this in action just follow the instructions here http://stackoverflow.com/a/15005321/141822 you get this output, which is clearly wrong ``` { "took" : 11, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "failed" : 0 }, "hits" : { "total" : 3, "max_score" : 0.71231794, "hits" : [ { "_index" : "myindex", "_type" : "product", "_id" : "0KyaIB8xRmqE-g0hl0ky6g", "_score" : 0.71231794, "fields" : { "code" : "Samsung Galaxy i7500" }, "highlight" : { "code.ngram" : [ "Samsung Galaxy i7500" ], "code" : [ "Samsung Galaxy i7500" ] } }, { "_index" : "myindex", "_type" : "product", "_id" : "vZwpcBu0QAyGmP9LHz1hUA", "_score" : 0.71231794, "fields" : { "code" : "Samsung Galaxy 5 Europa" }, "highlight" : { "code.ngram" : [ "Samsung Galaxy 5 Europa" ], "code" : [ "Samsung Galaxy 5 Europa" ] } }, { "_index" : "myindex", "_type" : "product", "_id" : "7sNkZAlxSlmuLZA9S68bvg", "_score" : 0.71231794, "fields" : { "code" : "Samsung Galaxy Mini" }, "highlight" : { "code.ngram" : [ "Samsung Galaxy Mini" ], "code" : [ "Samsung Galaxy Mini" ] } } ] } } ``` With the whitespace tokenizer (vs keyword tokenizer in this case), it highlights just the word with the match in it, which is still not expected behavior

Now, I'm using the "version": "4.1", though it may cause problems, as it is
written in the post. Also, I forgot to specify a different
"search_analyzer" in my Gist, which explains the match against "to". I
tried to get an example with a minimal configuration and cut that off

I think, however, that the documentation should have a few lines explaining
what you can and cannot expect from highlighting, because it can drive you
crazy.

El martes, 19 de noviembre de 2013 11:52:42 UTC+1, Guillermo Arias del Río
escribió:

Hi, all!

We recently updated from 0.90.1 to 0.90.6 and our highlighting tests began
to fail. Updating to 0.90.7 didn't work, so I think there is a bug, or at
least something changed in the specification...

I added a Gist to reproduce the problem:
Highlighting not working in ES 0.90.6/7 · GitHub

Here, I am searching for "te" in a field containing "some text to
highlight". The field is folded and tokenized using the ICU plug-in. The
value I expect (and got in earlier versions) is "some text to
highlight", but the returned value is "some text to
highlight". I checked with the three highlighters (I only need the
postings highlighter, but I thought I should check with the others as well).

Is this a bug or am I missing something?

Thanks,
Guillermo.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

javanna · November 20, 2013, 12:58pm

Hi Guillermo,
are you sure highlighting was working properly with the same mapping before?

What I see in your gist is that you use the edge ngram filter, the default
one available which unfortunately has min_gram 1 and max_gram 2 . Have a
look at the analyze api output to see what you are
indexing: http://localhost:9200/test_highlight/_analyze?analyzer=icu_analyzer&text=some%20text%20to%20highlight
.

"Some text to highlight" becomes s, t, t, h . At query time, te becomes t
as well (as you apply the same analyzer at search time too), which is why
you get the second and third token highlighted, makes sense to me.

On Tuesday, November 19, 2013 2:18:38 PM UTC+1, Guillermo Arias del Río
wrote:

I found an answer to this:
elasticsearch highlights entire word instead of just the query when ngram filter is used · Issue #3137 · elastic/elasticsearch · GitHub
Now, I'm using the "version": "4.1", though it may cause problems, as it
is written in the post. Also, I forgot to specify a different
"search_analyzer" in my Gist, which explains the match against "to". I
tried to get an example with a minimal configuration and cut that off

I think, however, that the documentation should have a few lines
explaining what you can and cannot expect from highlighting, because it can
drive you crazy.

El martes, 19 de noviembre de 2013 11:52:42 UTC+1, Guillermo Arias del Río
escribió:

Hi, all!

We recently updated from 0.90.1 to 0.90.6 and our highlighting tests
began to fail. Updating to 0.90.7 didn't work, so I think there is a bug,
or at least something changed in the specification...

I added a Gist to reproduce the problem:
Highlighting not working in ES 0.90.6/7 · GitHub

Here, I am searching for "te" in a field containing "some text to
highlight". The field is folded and tokenized using the ICU plug-in. The
value I expect (and got in earlier versions) is "some text to
highlight", but the returned value is "some text to
highlight". I checked with the three highlighters (I only need the
postings highlighter, but I thought I should check with the others as well).

Is this a bug or am I missing something?

Thanks,
Guillermo.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Highlighting bug in 0.90.0 and possibly 0.90.1? Elasticsearch	2	331	July 6, 2017
How to highlight partial word when using edge_ngram filter Elasticsearch	1	345	February 18, 2021
Highlighting on ngram search Elasticsearch	1	1017	March 19, 2020
Elasticsearch highlighting on ngram filter is wrwong if min_gram is set to 1 Elasticsearch	2	773	July 6, 2017
Highlight works not always! Elasticsearch	1	311	July 6, 2017

Highlighting not working for [edge_]ngram with the new versions

Related topics