Over the years a number of people have asked various questions in an effort to build an external highlighter (i.e., one that will highlight original media like PDFs using offset information returned from the query string). Nobody seems to have a good solution. I'm trying myself, but I'm getting confused about what "offsets" actually are in Elasticsearch.
Part 1: I just realized that if you use a folding filter like asciifolding
, it changes the offsets reported by the _analyze
endpoint. For example, if you run My œsophagus caused a débâcle
through the analyzer, the offsets you get back appear to reflect the utf-8 offsets of the text AFTER the asciifolding filter has been applied. Can anybody confirm that this is correct? This is really annoying because originally, I thought using the _analyze
endpoint would be a nice way to figure out the offsets of tokens in my original utf-8 byte stream. Obviously that's not going to work if the reported offsets are unrelated to what I fed Elasticsearch.
Part 2: This stackoverflow post suggests that the TermVector results are in UTF-16 offsets anyway. Is that correct? I'm surprised to learn that analysis and search have different concepts of what an offset it.
Any illumination appreciated.