Highlighting not working for [edge_]ngram with the new versions


(Guillermo Arias del Río) #1

Hi, all!

We recently updated from 0.90.1 to 0.90.6 and our highlighting tests began
to fail. Updating to 0.90.7 didn't work, so I think there is a bug, or at
least something changed in the specification...

I added a Gist to reproduce the problem:

Here, I am searching for "te" in a field containing "some text to
highlight". The field is folded and tokenized using the ICU plug-in. The
value I expect (and got in earlier versions) is "some text to
highlight"
, but the returned value is "some text to
highlight"
. I checked with the three highlighters (I only need the
postings highlighter, but I thought I should check with the others as well).

Is this a bug or am I missing something?

Thanks,
Guillermo.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Guillermo Arias del Río) #2

I found an answer to this:


Now, I'm using the "version": "4.1", though it may cause problems, as it is
written in the post. Also, I forgot to specify a different
"search_analyzer" in my Gist, which explains the match against "to". I
tried to get an example with a minimal configuration and cut that off :slight_smile:

I think, however, that the documentation should have a few lines explaining
what you can and cannot expect from highlighting, because it can drive you
crazy.

El martes, 19 de noviembre de 2013 11:52:42 UTC+1, Guillermo Arias del Río
escribió:

Hi, all!

We recently updated from 0.90.1 to 0.90.6 and our highlighting tests began
to fail. Updating to 0.90.7 didn't work, so I think there is a bug, or at
least something changed in the specification...

I added a Gist to reproduce the problem:
https://gist.github.com/ariasdelrio/7543562

Here, I am searching for "te" in a field containing "some text to
highlight". The field is folded and tokenized using the ICU plug-in. The
value I expect (and got in earlier versions) is "some text to
highlight"
, but the returned value is "some text to
highlight"
. I checked with the three highlighters (I only need the
postings highlighter, but I thought I should check with the others as well).

Is this a bug or am I missing something?

Thanks,
Guillermo.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Luca Cavanna) #3

Hi Guillermo,
are you sure highlighting was working properly with the same mapping before?

What I see in your gist is that you use the edge ngram filter, the default
one available which unfortunately has min_gram 1 and max_gram 2 . Have a
look at the analyze api output to see what you are
indexing: http://localhost:9200/test_highlight/_analyze?analyzer=icu_analyzer&text=some%20text%20to%20highlight
.

"Some text to highlight" becomes s, t, t, h . At query time, te becomes t
as well (as you apply the same analyzer at search time too), which is why
you get the second and third token highlighted, makes sense to me.

On Tuesday, November 19, 2013 2:18:38 PM UTC+1, Guillermo Arias del Río
wrote:

I found an answer to this:
https://github.com/elasticsearch/elasticsearch/issues/3137
Now, I'm using the "version": "4.1", though it may cause problems, as it
is written in the post. Also, I forgot to specify a different
"search_analyzer" in my Gist, which explains the match against "to". I
tried to get an example with a minimal configuration and cut that off :slight_smile:

I think, however, that the documentation should have a few lines
explaining what you can and cannot expect from highlighting, because it can
drive you crazy.

El martes, 19 de noviembre de 2013 11:52:42 UTC+1, Guillermo Arias del Río
escribió:

Hi, all!

We recently updated from 0.90.1 to 0.90.6 and our highlighting tests
began to fail. Updating to 0.90.7 didn't work, so I think there is a bug,
or at least something changed in the specification...

I added a Gist to reproduce the problem:
https://gist.github.com/ariasdelrio/7543562

Here, I am searching for "te" in a field containing "some text to
highlight". The field is folded and tokenized using the ICU plug-in. The
value I expect (and got in earlier versions) is "some text to
highlight"
, but the returned value is "some text to
highlight"
. I checked with the three highlighters (I only need the
postings highlighter, but I thought I should check with the others as well).

Is this a bug or am I missing something?

Thanks,
Guillermo.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #4