Edge Ngram gives bad highlight when using position offsets

Hello,

I've posted a question on StackOverflow but nobody answered:

Does someone know why when using position offsets, edge ngrams only
highlights the smallest matching token?

I would expect it to highlight the longest matching token, because with an
edge-ngrams with min size = 3, this gives me a highlight of 3 chars while
there should be another matching token of 5 or 6 chars (for exemple)

Thanks

--

Nobody knows?

On Tuesday, January 1, 2013 10:14:17 PM UTC+1, Sébastien Lorber wrote:

Hello,

I've posted a question on StackOverflow but nobody answered:

java - Elasticsearch - EdgeNgram + highlight + term_vector = bad highlights - Stack Overflow

Does someone know why when using position offsets, edge ngrams only
highlights the smallest matching token?

I would expect it to highlight the longest matching token, because with an
edge-ngrams with min size = 3, this gives me a highlight of 3 chars while
there should be another matching token of 5 or 6 chars (for exemple)

Thanks

--

Hi Sebastien,

What version of ES are you using and are you using the the ngram
tokenizer also during query time?
In your case the edge ngram should be applied only during index time.
Otherwise your query term gets expanded into cou and couc.

With the latest ES version (0.20.2) the highlighting on a field that
has the edge ngram filter configured works as expected:

Martijn

On 3 January 2013 14:51, Sébastien Lorber lorber.sebastien@gmail.com wrote:

Nobody knows?

On Tuesday, January 1, 2013 10:14:17 PM UTC+1, Sébastien Lorber wrote:

Hello,

I've posted a question on StackOverflow but nobody answered:

java - Elasticsearch - EdgeNgram + highlight + term_vector = bad highlights - Stack Overflow

Does someone know why when using position offsets, edge ngrams only
highlights the smallest matching token?

I would expect it to highlight the longest matching token, because with an
edge-ngrams with min size = 3, this gives me a highlight of 3 chars while
there should be another matching token of 5 or 6 chars (for exemple)

Thanks

--

--
Met vriendelijke groet,

Martijn van Groningen

--

Hi,

By mistake I sent a private message so for people interested here is the
conversation:


Thanks

I guess the problem is that I read almost everywhere to always use the same
analyzers for index and query time.
I'm not a search export so I follow this advice.

But yes it seems right that it is because of the search term tokenization
into "cou" and "couc" and other tokens...

So in my index there are "cou" and "couc" and other tokens...
And my query is tokenized into "cou" and "couc" and other tokens...

Wouldn't it make sense that the largest maching token is choosed?
My intuition tells me that the "positions offsets" are kind of attached to
the indexed tokens no?
So if the match happens on many tokens, it would be nice to:

  • Not have a biased scoring because of the number of tokens
  • Keep the largest token so that my highlight is based on the position
    offsets of "couc" and not "cou"
    Perhaps there is an option to do that?

Answer:

In the case of EgdeNGram it is best to only apply it at index time
(e.g. in the case for autocomplete like functionality).

I think the fast vector highlighter just prefers the shortest match if
it occurs on the same location. Each token does have an offset
stored in the term vector inside the index, but I'm not sure if what
you want is possible with fast vector highlighting.

For normal ngram token filter it usually does make sense to have have
it configured at both query and index time.

Martijn

On Tuesday, January 1, 2013 10:14:17 PM UTC+1, Sébastien Lorber wrote:

Hello,

I've posted a question on StackOverflow but nobody answered:

java - Elasticsearch - EdgeNgram + highlight + term_vector = bad highlights - Stack Overflow

Does someone know why when using position offsets, edge ngrams only
highlights the smallest matching token?

I would expect it to highlight the longest matching token, because with an
edge-ngrams with min size = 3, this gives me a highlight of 3 chars while
there should be another matching token of 5 or 6 chars (for exemple)

Thanks

--