Hey Elasticsearch, hopefully someone can at least explain if this is
intentional and how it happens(I have had other fragment highlighting
issues not unlike this)
The problem seems simple, I have a 64 character string that I generate 62
tokens for. Whenever I search for the entire string, I end up getting the
highlight applied to the 50th fragment instead of the one that actually
most nearly matches my search query.
Also confusing is if I try a very similar search, trying to use an exact
match on the SHA1 or MD5 attributes -- highlighting works like I'd expect
it to.
Hi Elasticsearch, still waiting to see if this is a known issue, possibly
that's resolved in a future release, or if this is something I did? I'd
appreciate knowing, at least, if anyone can help. Thanks much.
On Friday, March 14, 2014 5:29:10 PM UTC-4, Jon-Paul Lussier wrote:
Hey Elasticsearch, hopefully someone can at least explain if this is
intentional and how it happens(I have had other fragment highlighting
issues not unlike this)
The problem seems simple, I have a 64 character string that I generate 62
tokens for. Whenever I search for the entire string, I end up getting the
highlight applied to the 50th fragment instead of the one that actually
most nearly matches my search query.
Also confusing is if I try a very similar search, trying to use an exact
match on the SHA1 or MD5 attributes -- highlighting works like I'd expect
it to.
I can confirm this issue is reproducible in 1.0.1 release
On Friday, March 14, 2014 5:29:10 PM UTC-4, Jon-Paul Lussier wrote:
Hey Elasticsearch, hopefully someone can at least explain if this is
intentional and how it happens(I have had other fragment highlighting
issues not unlike this)
The problem seems simple, I have a 64 character string that I generate 62
tokens for. Whenever I search for the entire string, I end up getting the
highlight applied to the 50th fragment instead of the one that actually
most nearly matches my search query.
Also confusing is if I try a very similar search, trying to use an exact
match on the SHA1 or MD5 attributes -- highlighting works like I'd expect
it to.
this causes one match per char. The plain highlighter will combine these
matches because they overlap (b, ba, ba0, ba0d, etc) but only the first 50
so your match is cutoff at 50 chars. Why 50? Because;
private static final int MAX_NUM_TOKENS_PER_GROUP = 50;
None of the highlighters do a great job with this kind of match, but the
postings highlighter at least highlights everything:
<strong class="highlight">ba0<strong
class="highlight">d<strong class="highlight">7<strong
class="highlight">2<strong class="highlight">2<strong
class="highlight">f<strong class="highlight">0<strong
class="highlight">4<strong class="highlight">9<strong
class="highlight">3<strong class="highlight">f<strong
class="highlight">9<strong class="highlight">8<strong
class="highlight">6<strong class="highlight">e<strong
class="highlight">8<strong class="highlight">d<strong
class="highlight">3<strong class="highlight">f<strong
class="highlight">0<strong class="highlight">1<strong
class="highlight">2<strong class="highlight">9<strong
class="highlight">f<strong class="highlight">a<strong
class="highlight">e<strong class="highlight">0<strong
class="highlight">9<strong class="highlight">1<strong
class="highlight">9<strong class="highlight">a<strong
class="highlight">f<strong class="highlight">0<strong
class="highlight">5<strong class="highlight">4<strong
class="highlight">f<strong class="highlight">7<strong
class="highlight">1<strong class="highlight">0<strong
class="highlight">3<strong class="highlight">6<strong
class="highlight">0<strong class="highlight">8<strong
class="highlight">9<strong class="highlight">f<strong
class="highlight">d<strong class="highlight">8<strong
class="highlight">1<strong class="highlight">b<strong
class="highlight">c<strong class="highlight">d<strong
class="highlight">1<strong class="highlight">3<strong
class="highlight">8<strong class="highlight">d<strong
class="highlight">6<strong class="highlight">3<strong
class="highlight">9<strong class="highlight">e<strong
class="highlight">0<strong class="highlight">9<strong
class="highlight">d
So, your options:
Live with it.
Turn on the postings highlighter by indexing the sha256 field with
index_options set to offsets.
Wait until something else comes along.
Maybe an application level hack based on the string lengths.
I can confirm this issue is reproducible in 1.0.1 release
On Friday, March 14, 2014 5:29:10 PM UTC-4, Jon-Paul Lussier wrote:
Hey Elasticsearch, hopefully someone can at least explain if this is
intentional and how it happens(I have had other fragment highlighting
issues not unlike this)
The problem seems simple, I have a 64 character string that I generate 62
tokens for. Whenever I search for the entire string, I end up getting the
highlight applied to the 50th fragment instead of the one that actually
most nearly matches my search query.
Also confusing is if I try a very similar search, trying to use an exact
match on the SHA1 or MD5 attributes -- highlighting works like I'd expect
it to.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.