Edge Ngram gives bad highlight when using position offsets

Sebastien_Lorber · January 1, 2013, 9:14pm

Hello,

I've posted a question on StackOverflow but nobody answered:

Does someone know why when using position offsets, edge ngrams only
highlights the smallest matching token?

I would expect it to highlight the longest matching token, because with an
edge-ngrams with min size = 3, this gives me a highlight of 3 chars while
there should be another matching token of 5 or 6 chars (for exemple)

Thanks

--

Sebastien_Lorber · January 3, 2013, 1:51pm

Nobody knows?

On Tuesday, January 1, 2013 10:14:17 PM UTC+1, Sébastien Lorber wrote:

Hello,

I've posted a question on StackOverflow but nobody answered:

java - Elasticsearch - EdgeNgram + highlight + term_vector = bad highlights - Stack Overflow

Does someone know why when using position offsets, edge ngrams only
highlights the smallest matching token?

I would expect it to highlight the longest matching token, because with an
edge-ngrams with min size = 3, this gives me a highlight of 3 chars while
there should be another matching token of 5 or 6 chars (for exemple)

Thanks

--

mvg · January 3, 2013, 3:17pm

Hi Sebastien,

What version of ES are you using and are you using the the ngram
tokenizer also during query time?
In your case the edge ngram should be applied only during index time.
Otherwise your query term gets expanded into cou and couc.

With the latest ES version (0.20.2) the highlighting on a field that
has the edge ngram filter configured works as expected:

gist.github.com

https://gist.github.com/martijnvg/4444187

11303660.sh

curl -XDELETE 'localhost:9200/test'
echo 

curl -XPUT 'localhost:9200/test?pretty' -d '{
  "mappings" : {
		"test" : {
			"properties" : {
				"text" : {
					"type" : "string",
					"term_vector" : "with_positions_offsets",

This file has been truncated. show original

Martijn

On 3 January 2013 14:51, Sébastien Lorber lorber.sebastien@gmail.com wrote:

Nobody knows?

On Tuesday, January 1, 2013 10:14:17 PM UTC+1, Sébastien Lorber wrote:

Hello,

I've posted a question on StackOverflow but nobody answered:

java - Elasticsearch - EdgeNgram + highlight + term_vector = bad highlights - Stack Overflow

Does someone know why when using position offsets, edge ngrams only
highlights the smallest matching token?

I would expect it to highlight the longest matching token, because with an
edge-ngrams with min size = 3, this gives me a highlight of 3 chars while
there should be another matching token of 5 or 6 chars (for exemple)

Thanks

--

--
Met vriendelijke groet,

Martijn van Groningen

--

Sebastien_Lorber · January 4, 2013, 11:27pm

Hi,

By mistake I sent a private message so for people interested here is the
conversation:

Thanks

I guess the problem is that I read almost everywhere to always use the same
analyzers for index and query time.
I'm not a search export so I follow this advice.

But yes it seems right that it is because of the search term tokenization
into "cou" and "couc" and other tokens...

So in my index there are "cou" and "couc" and other tokens...
And my query is tokenized into "cou" and "couc" and other tokens...

Wouldn't it make sense that the largest maching token is choosed?
My intuition tells me that the "positions offsets" are kind of attached to
the indexed tokens no?
So if the match happens on many tokens, it would be nice to:

Not have a biased scoring because of the number of tokens
Keep the largest token so that my highlight is based on the position
offsets of "couc" and not "cou"
Perhaps there is an option to do that?

Answer:

In the case of EgdeNGram it is best to only apply it at index time
(e.g. in the case for autocomplete like functionality).

I think the fast vector highlighter just prefers the shortest match if
it occurs on the same location. Each token does have an offset
stored in the term vector inside the index, but I'm not sure if what
you want is possible with fast vector highlighting.

For normal ngram token filter it usually does make sense to have have
it configured at both query and index time.

Martijn

On Tuesday, January 1, 2013 10:14:17 PM UTC+1, Sébastien Lorber wrote:

Hello,

I've posted a question on StackOverflow but nobody answered:

java - Elasticsearch - EdgeNgram + highlight + term_vector = bad highlights - Stack Overflow

Does someone know why when using position offsets, edge ngrams only
highlights the smallest matching token?

I would expect it to highlight the longest matching token, because with an
edge-ngrams with min size = 3, this gives me a highlight of 3 chars while
there should be another matching token of 5 or 6 chars (for exemple)

Thanks

--

Topic		Replies	Views
Elasticsearch - EdgeNgram + highlight + term_vector = bad highlights Elasticsearch	1	345	July 6, 2017
Highlighting with edge ngram token + synonym filter Elasticsearch	1	1186	July 30, 2020
How to highlight partial word when using edge_ngram filter Elasticsearch	1	345	February 18, 2021
Highlighting issue with fuzzy query with edge_ngram tokens Elasticsearch	0	54	July 2, 2024
Highlighting not working for [edge_]ngram with the new versions Elasticsearch	3	1067	July 6, 2017

Edge Ngram gives bad highlight when using position offsets

Related topics