Confusing highlight result when creating many tokens


(Jon-Paul Lussier) #1

Hey Elasticsearch, hopefully someone can at least explain if this is
intentional and how it happens(I have had other fragment highlighting
issues not unlike this)

The problem seems simple, I have a 64 character string that I generate 62
tokens for. Whenever I search for the entire string, I end up getting the
highlight applied to the 50th fragment instead of the one that actually
most nearly matches my search query.

Also confusing is if I try a very similar search, trying to use an exact
match on the SHA1 or MD5 attributes -- highlighting works like I'd expect
it to.

Please see the gist
here: https://gist.github.com/jonpaul/d4a9aa7f9c8741933cf5

Currently I'm using 1.0.0-BETA2 so this may be a fixed bug, sorry if
that's the case, I couldn't find anything that matches my problem per se.

Thanks very much in advance for help anyone can provide!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6ed73d7d-fef8-4052-92a1-df2779795519%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Jon-Paul Lussier) #2

Hi Elasticsearch, still waiting to see if this is a known issue, possibly
that's resolved in a future release, or if this is something I did? I'd
appreciate knowing, at least, if anyone can help. Thanks much.

On Friday, March 14, 2014 5:29:10 PM UTC-4, Jon-Paul Lussier wrote:

Hey Elasticsearch, hopefully someone can at least explain if this is
intentional and how it happens(I have had other fragment highlighting
issues not unlike this)

The problem seems simple, I have a 64 character string that I generate 62
tokens for. Whenever I search for the entire string, I end up getting the
highlight applied to the 50th fragment instead of the one that actually
most nearly matches my search query.

Also confusing is if I try a very similar search, trying to use an exact
match on the SHA1 or MD5 attributes -- highlighting works like I'd expect
it to.

Please see the gist here:
https://gist.github.com/jonpaul/d4a9aa7f9c8741933cf5

Currently I'm using 1.0.0-BETA2 so this may be a fixed bug, sorry if
that's the case, I couldn't find anything that matches my problem per se.

Thanks very much in advance for help anyone can provide!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e2a9657d-e5df-4e0c-b1dc-78b13457827c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Jon-Paul Lussier) #3

I can confirm this issue is reproducible in 1.0.1 release

On Friday, March 14, 2014 5:29:10 PM UTC-4, Jon-Paul Lussier wrote:

Hey Elasticsearch, hopefully someone can at least explain if this is
intentional and how it happens(I have had other fragment highlighting
issues not unlike this)

The problem seems simple, I have a 64 character string that I generate 62
tokens for. Whenever I search for the entire string, I end up getting the
highlight applied to the 50th fragment instead of the one that actually
most nearly matches my search query.

Also confusing is if I try a very similar search, trying to use an exact
match on the SHA1 or MD5 attributes -- highlighting works like I'd expect
it to.

Please see the gist here:
https://gist.github.com/jonpaul/d4a9aa7f9c8741933cf5

Currently I'm using 1.0.0-BETA2 so this may be a fixed bug, sorry if
that's the case, I couldn't find anything that matches my problem per se.

Thanks very much in advance for help anyone can provide!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a21e8609-3fea-4f1f-9fec-8104d45ad5a4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Nik Everett) #4

Your confusing query is actually broken up into the following query:

filtered(((md5:ba0 md5:ba0d md5:ba0d7 md5:ba0d72 md5:ba0d722 md5:ba0d722f
md5:ba0d722f0 md5:ba0d722f04 md5:ba0d722f049 md5:ba0d722f0493
md5:ba0d722f0493f md5:ba0d722f0493f9 md5:ba0d722f0493f98
md5:ba0d722f0493f986 md5:ba0d722f0493f986e md5:ba0d722f0493f986e8
md5:ba0d722f0493f986e8d md5:ba0d722f0493f986e8d3 md5:ba0d722f0493f986e8d3f
md5:ba0d722f0493f986e8d3f0 md5:ba0d722f0493f986e8d3f01
md5:ba0d722f0493f986e8d3f012 md5:ba0d722f0493f986e8d3f0129
md5:ba0d722f0493f986e8d3f0129f md5:ba0d722f0493f986e8d3f0129fa
md5:ba0d722f0493f986e8d3f0129fae md5:ba0d722f0493f986e8d3f0129fae0
md5:ba0d722f0493f986e8d3f0129fae09 md5:ba0d722f0493f986e8d3f0129fae091
md5:ba0d722f0493f986e8d3f0129fae0919 md5:ba0d722f0493f986e8d3f0129fae0919a
md5:ba0d722f0493f986e8d3f0129fae0919af
md5:ba0d722f0493f986e8d3f0129fae0919af0
md5:ba0d722f0493f986e8d3f0129fae0919af05
md5:ba0d722f0493f986e8d3f0129fae0919af054
md5:ba0d722f0493f986e8d3f0129fae0919af054f
md5:ba0d722f0493f986e8d3f0129fae0919af054f7
md5:ba0d722f0493f986e8d3f0129fae0919af054f71
md5:ba0d722f0493f986e8d3f0129fae0919af054f710
md5:ba0d722f0493f986e8d3f0129fae0919af054f7103
md5:ba0d722f0493f986e8d3f0129fae0919af054f71036
md5:ba0d722f0493f986e8d3f0129fae0919af054f710360
md5:ba0d722f0493f986e8d3f0129fae0919af054f7103608
md5:ba0d722f0493f986e8d3f0129fae0919af054f71036089
md5:ba0d722f0493f986e8d3f0129fae0919af054f71036089f
md5:ba0d722f0493f986e8d3f0129fae0919af054f71036089fd
md5:ba0d722f0493f986e8d3f0129fae0919af054f71036089fd8
md5:ba0d722f0493f986e8d3f0129fae0919af054f71036089fd81
md5:ba0d722f0493f986e8d3f0129fae0919af054f71036089fd81b
md5:ba0d722f0493f986e8d3f0129fae0919af054f71036089fd81bc
md5:ba0d722f0493f986e8d3f0129fae0919af054f71036089fd81bcd
md5:ba0d722f0493f986e8d3f0129fae0919af054f71036089fd81bcd1
md5:ba0d722f0493f986e8d3f0129fae0919af054f71036089fd81bcd13
md5:ba0d722f0493f986e8d3f0129fae0919af054f71036089fd81bcd138
md5:ba0d722f0493f986e8d3f0129fae0919af054f71036089fd81bcd138d
md5:ba0d722f0493f986e8d3f0129fae0919af054f71036089fd81bcd138d6
md5:ba0d722f0493f986e8d3f0129fae0919af054f71036089fd81bcd138d63
md5:ba0d722f0493f986e8d3f0129fae0919af054f71036089fd81bcd138d639
md5:ba0d722f0493f986e8d3f0129fae0919af054f71036089fd81bcd138d639e
md5:ba0d722f0493f986e8d3f0129fae0919af054f71036089fd81bcd138d639e0
md5:ba0d722f0493f986e8d3f0129fae0919af054f71036089fd81bcd138d639e09
md5:ba0d722f0493f986e8d3f0129fae0919af054f71036089fd81bcd138d639e09d)
(sha1:ba0 sha1:ba0d sha1:ba0d7 sha1:ba0d72 sha1:ba0d722 sha1:ba0d722f
sha1:ba0d722f0 sha1:ba0d722f04 sha1:ba0d722f049 sha1:ba0d722f0493
sha1:ba0d722f0493f sha1:ba0d722f0493f9 sha1:ba0d722f0493f98
sha1:ba0d722f0493f986 sha1:ba0d722f0493f986e sha1:ba0d722f0493f986e8
sha1:ba0d722f0493f986e8d sha1:ba0d722f0493f986e8d3
sha1:ba0d722f0493f986e8d3f sha1:ba0d722f0493f986e8d3f0
sha1:ba0d722f0493f986e8d3f01 sha1:ba0d722f0493f986e8d3f012
sha1:ba0d722f0493f986e8d3f0129 sha1:ba0d722f0493f986e8d3f0129f
sha1:ba0d722f0493f986e8d3f0129fa sha1:ba0d722f0493f986e8d3f0129fae
sha1:ba0d722f0493f986e8d3f0129fae0 sha1:ba0d722f0493f986e8d3f0129fae09
sha1:ba0d722f0493f986e8d3f0129fae091 sha1:ba0d722f0493f986e8d3f0129fae0919
sha1:ba0d722f0493f986e8d3f0129fae0919a
sha1:ba0d722f0493f986e8d3f0129fae0919af
sha1:ba0d722f0493f986e8d3f0129fae0919af0
sha1:ba0d722f0493f986e8d3f0129fae0919af05
sha1:ba0d722f0493f986e8d3f0129fae0919af054
sha1:ba0d722f0493f986e8d3f0129fae0919af054f
sha1:ba0d722f0493f986e8d3f0129fae0919af054f7
sha1:ba0d722f0493f986e8d3f0129fae0919af054f71
sha1:ba0d722f0493f986e8d3f0129fae0919af054f710
sha1:ba0d722f0493f986e8d3f0129fae0919af054f7103
sha1:ba0d722f0493f986e8d3f0129fae0919af054f71036
sha1:ba0d722f0493f986e8d3f0129fae0919af054f710360
sha1:ba0d722f0493f986e8d3f0129fae0919af054f7103608
sha1:ba0d722f0493f986e8d3f0129fae0919af054f71036089
sha1:ba0d722f0493f986e8d3f0129fae0919af054f71036089f
sha1:ba0d722f0493f986e8d3f0129fae0919af054f71036089fd
sha1:ba0d722f0493f986e8d3f0129fae0919af054f71036089fd8
sha1:ba0d722f0493f986e8d3f0129fae0919af054f71036089fd81
sha1:ba0d722f0493f986e8d3f0129fae0919af054f71036089fd81b
sha1:ba0d722f0493f986e8d3f0129fae0919af054f71036089fd81bc
sha1:ba0d722f0493f986e8d3f0129fae0919af054f71036089fd81bcd
sha1:ba0d722f0493f986e8d3f0129fae0919af054f71036089fd81bcd1
sha1:ba0d722f0493f986e8d3f0129fae0919af054f71036089fd81bcd13
sha1:ba0d722f0493f986e8d3f0129fae0919af054f71036089fd81bcd138
sha1:ba0d722f0493f986e8d3f0129fae0919af054f71036089fd81bcd138d
sha1:ba0d722f0493f986e8d3f0129fae0919af054f71036089fd81bcd138d6
sha1:ba0d722f0493f986e8d3f0129fae0919af054f71036089fd81bcd138d63
sha1:ba0d722f0493f986e8d3f0129fae0919af054f71036089fd81bcd138d639
sha1:ba0d722f0493f986e8d3f0129fae0919af054f71036089fd81bcd138d639e
sha1:ba0d722f0493f986e8d3f0129fae0919af054f71036089fd81bcd138d639e0
sha1:ba0d722f0493f986e8d3f0129fae0919af054f71036089fd81bcd138d639e09
sha1:ba0d722f0493f986e8d3f0129fae0919af054f71036089fd81bcd138d639e09d)
(sha256:ba0 sha256:ba0d sha256:ba0d7 sha256:ba0d72 sha256:ba0d722
sha256:ba0d722f sha256:ba0d722f0 sha256:ba0d722f04 sha256:ba0d722f049
sha256:ba0d722f0493 sha256:ba0d722f0493f sha256:ba0d722f0493f9
sha256:ba0d722f0493f98 sha256:ba0d722f0493f986 sha256:ba0d722f0493f986e
sha256:ba0d722f0493f986e8 sha256:ba0d722f0493f986e8d
sha256:ba0d722f0493f986e8d3 sha256:ba0d722f0493f986e8d3f
sha256:ba0d722f0493f986e8d3f0 sha256:ba0d722f0493f986e8d3f01
sha256:ba0d722f0493f986e8d3f012 sha256:ba0d722f0493f986e8d3f0129
sha256:ba0d722f0493f986e8d3f0129f sha256:ba0d722f0493f986e8d3f0129fa
sha256:ba0d722f0493f986e8d3f0129fae sha256:ba0d722f0493f986e8d3f0129fae0
sha256:ba0d722f0493f986e8d3f0129fae09
sha256:ba0d722f0493f986e8d3f0129fae091
sha256:ba0d722f0493f986e8d3f0129fae0919
sha256:ba0d722f0493f986e8d3f0129fae0919a
sha256:ba0d722f0493f986e8d3f0129fae0919af
sha256:ba0d722f0493f986e8d3f0129fae0919af0
sha256:ba0d722f0493f986e8d3f0129fae0919af05
sha256:ba0d722f0493f986e8d3f0129fae0919af054
sha256:ba0d722f0493f986e8d3f0129fae0919af054f
sha256:ba0d722f0493f986e8d3f0129fae0919af054f7
sha256:ba0d722f0493f986e8d3f0129fae0919af054f71
sha256:ba0d722f0493f986e8d3f0129fae0919af054f710
sha256:ba0d722f0493f986e8d3f0129fae0919af054f7103
sha256:ba0d722f0493f986e8d3f0129fae0919af054f71036
sha256:ba0d722f0493f986e8d3f0129fae0919af054f710360
sha256:ba0d722f0493f986e8d3f0129fae0919af054f7103608
sha256:ba0d722f0493f986e8d3f0129fae0919af054f71036089
sha256:ba0d722f0493f986e8d3f0129fae0919af054f71036089f
sha256:ba0d722f0493f986e8d3f0129fae0919af054f71036089fd
sha256:ba0d722f0493f986e8d3f0129fae0919af054f71036089fd8
sha256:ba0d722f0493f986e8d3f0129fae0919af054f71036089fd81
sha256:ba0d722f0493f986e8d3f0129fae0919af054f71036089fd81b
sha256:ba0d722f0493f986e8d3f0129fae0919af054f71036089fd81bc
sha256:ba0d722f0493f986e8d3f0129fae0919af054f71036089fd81bcd
sha256:ba0d722f0493f986e8d3f0129fae0919af054f71036089fd81bcd1
sha256:ba0d722f0493f986e8d3f0129fae0919af054f71036089fd81bcd13
sha256:ba0d722f0493f986e8d3f0129fae0919af054f71036089fd81bcd138
sha256:ba0d722f0493f986e8d3f0129fae0919af054f71036089fd81bcd138d
sha256:ba0d722f0493f986e8d3f0129fae0919af054f71036089fd81bcd138d6
sha256:ba0d722f0493f986e8d3f0129fae0919af054f71036089fd81bcd138d63
sha256:ba0d722f0493f986e8d3f0129fae0919af054f71036089fd81bcd138d639
sha256:ba0d722f0493f986e8d3f0129fae0919af054f71036089fd81bcd138d639e
sha256:ba0d722f0493f986e8d3f0129fae0919af054f71036089fd81bcd138d639e0
sha256:ba0d722f0493f986e8d3f0129fae0919af054f71036089fd81bcd138d639e09
sha256:ba0d722f0493f986e8d3f0129fae0919af054f71036089fd81bcd138d639e09d)))->cache(_type:sample)

this causes one match per char. The plain highlighter will combine these
matches because they overlap (b, ba, ba0, ba0d, etc) but only the first 50
so your match is cutoff at 50 chars. Why 50? Because;
private static final int MAX_NUM_TOKENS_PER_GROUP = 50;

None of the highlighters do a great job with this kind of match, but the
postings highlighter at least highlights everything:
<strong class="highlight">ba0<strong
class="highlight">d<strong class="highlight">7<strong
class="highlight">2<strong class="highlight">2<strong
class="highlight">f<strong class="highlight">0<strong
class="highlight">4<strong class="highlight">9<strong
class="highlight">3<strong class="highlight">f<strong
class="highlight">9<strong class="highlight">8<strong
class="highlight">6<strong class="highlight">e<strong
class="highlight">8<strong class="highlight">d<strong
class="highlight">3<strong class="highlight">f<strong
class="highlight">0<strong class="highlight">1<strong
class="highlight">2<strong class="highlight">9<strong
class="highlight">f<strong class="highlight">a<strong
class="highlight">e<strong class="highlight">0<strong
class="highlight">9<strong class="highlight">1<strong
class="highlight">9<strong class="highlight">a<strong
class="highlight">f<strong class="highlight">0<strong
class="highlight">5<strong class="highlight">4<strong
class="highlight">f<strong class="highlight">7<strong
class="highlight">1<strong class="highlight">0<strong
class="highlight">3<strong class="highlight">6<strong
class="highlight">0<strong class="highlight">8<strong
class="highlight">9<strong class="highlight">f<strong
class="highlight">d<strong class="highlight">8<strong
class="highlight">1<strong class="highlight">b<strong
class="highlight">c<strong class="highlight">d<strong
class="highlight">1<strong class="highlight">3<strong
class="highlight">8<strong class="highlight">d<strong
class="highlight">6<strong class="highlight">3<strong
class="highlight">9<strong class="highlight">e<strong
class="highlight">0<strong class="highlight">9<strong
class="highlight">d

So, your options:

  1. Live with it.
  2. Turn on the postings highlighter by indexing the sha256 field with
    index_options set to offsets.
  3. Wait until something else comes along.
  4. Maybe an application level hack based on the string lengths.
  5. Something else?

Nik

On Thu, Mar 27, 2014 at 2:39 PM, Jon-Paul Lussier <jonpaul.lussier@gmail.com

wrote:

I can confirm this issue is reproducible in 1.0.1 release

On Friday, March 14, 2014 5:29:10 PM UTC-4, Jon-Paul Lussier wrote:

Hey Elasticsearch, hopefully someone can at least explain if this is
intentional and how it happens(I have had other fragment highlighting
issues not unlike this)

The problem seems simple, I have a 64 character string that I generate 62
tokens for. Whenever I search for the entire string, I end up getting the
highlight applied to the 50th fragment instead of the one that actually
most nearly matches my search query.

Also confusing is if I try a very similar search, trying to use an exact
match on the SHA1 or MD5 attributes -- highlighting works like I'd expect
it to.

Please see the gist here: https://gist.github.com/
jonpaul/d4a9aa7f9c8741933cf5

Currently I'm using 1.0.0-BETA2 so this may be a fixed bug, sorry if
that's the case, I couldn't find anything that matches my problem per se.

Thanks very much in advance for help anyone can provide!

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/a21e8609-3fea-4f1f-9fec-8104d45ad5a4%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/a21e8609-3fea-4f1f-9fec-8104d45ad5a4%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd2UmY%3DxNmHe8d4HCNKtsT-cBmRJjDeqUxK2Qj1R5aJvMA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(system) #5