Hi,
By mistake I sent a private message so for people interested here is the
conversation:
Thanks
I guess the problem is that I read almost everywhere to always use the same
analyzers for index and query time.
I'm not a search export so I follow this advice.
But yes it seems right that it is because of the search term tokenization
into "cou" and "couc" and other tokens...
So in my index there are "cou" and "couc" and other tokens...
And my query is tokenized into "cou" and "couc" and other tokens...
Wouldn't it make sense that the largest maching token is choosed?
My intuition tells me that the "positions offsets" are kind of attached to
the indexed tokens no?
So if the match happens on many tokens, it would be nice to:
- Not have a biased scoring because of the number of tokens
- Keep the largest token so that my highlight is based on the position
offsets of "couc" and not "cou"
Perhaps there is an option to do that?
Answer:
In the case of EgdeNGram it is best to only apply it at index time
(e.g. in the case for autocomplete like functionality).
I think the fast vector highlighter just prefers the shortest match if
it occurs on the same location. Each token does have an offset
stored in the term vector inside the index, but I'm not sure if what
you want is possible with fast vector highlighting.
For normal ngram token filter it usually does make sense to have have
it configured at both query and index time.
Martijn
On Tuesday, January 1, 2013 10:14:17 PM UTC+1, Sébastien Lorber wrote:
Hello,
I've posted a question on StackOverflow but nobody answered:
java - Elasticsearch - EdgeNgram + highlight + term_vector = bad highlights - Stack Overflow
Does someone know why when using position offsets, edge ngrams only
highlights the smallest matching token?
I would expect it to highlight the longest matching token, because with an
edge-ngrams with min size = 3, this gives me a highlight of 3 chars while
there should be another matching token of 5 or 6 chars (for exemple)
Thanks
--