Highlighter trim a field, why?


(Thomas Decaux) #1

I use default mapping:

PUT /tom/test/1
{
   "title" : "tom",
   "url" : "http://www.fetedelascience.fr/",
   "url2" : "http://www.fetedelascience.fr/3"
}

POST /tom/_search
{
"query": {
"match": {
"title": "tom"
}
},
"highlight": {
"no_match_size":155,
"fields": {
"url" : {},
"url2" : {}
}
}
}

Gives me:

{
    "_index": "tom",
    "_type": "test",
    "_id": "1",
    "_score": 0.30685282,
    "_source": {
       "title": "tom",
       "url": "http://www.fetedelascience.fr/",
       "url2": "http://www.fetedelascience.fr/3"
    },
    "highlight": {
       "url2": [
          "http://www.fetedelascience.fr/3"
       ],
       "url": [
          "http://www.fetedelascience.fr"
       ]
    }
 }

Why the last slash of URL has disappeared?


Solved: fragCharSize(0) is too small. It must be 18 or higher
(Nik Everett) #2

Weird. Its just how the no_match segmenter works in the plain highlighter. It just grabs text ending at the last token before the end of the text. I wrote this many years ago to simulate how the plain highlighter does segmentation when it finds hits but it looks like its wrong. This is a bug but I don't think it'll be too high on my priority list, sadly:


(Thomas Decaux) #3

No problem, happy to find the reason at least!

This is something wrote in ES not in Lucene, here https://github.com/elastic/elasticsearch/tree/master/core/src/main/java/org/elasticsearch/search/highlight ?

Do you think if I use fast_ or posting_ hightligther this could fix it?

Thanks you,


(Nik Everett) #4

Yeah, I'm aware. To varying degrees they are able to delegate down to the Lucene bits.

They all implement the process differently. You should try on the fvh, its more likely to work. The postings highlighter isn't going to do what you want unless you feed it complete sentences.


(Thomas Decaux) #5

Just curious, why not rely on Lucene highlighter?

I will test fvh tomorrow, after re-indexing my data.

thanks you


(Nik Everett) #6

Lucene doesn't have support for no_match_size. Most of the code elasticsearch has for highlighting is really just to adapt the API into Lucene's highlighters. no_match_size is kind of an anomaly in that its trying to implement something without upstreaming it. And I'm not 100% sure why I didn't upstream the change at the time.


(system) #7