Mapping offsets to matching tokens

Justin_Lee · August 23, 2016, 7:48pm

Over the years a number of people have asked various questions in an effort to build an external highlighter (i.e., one that will highlight original media like PDFs using offset information returned from the query string). Nobody seems to have a good solution. I'm trying myself, but I'm getting confused about what "offsets" actually are in Elasticsearch.

Part 1: I just realized that if you use a folding filter like asciifolding, it changes the offsets reported by the _analyze endpoint. For example, if you run My œsophagus caused a débâcle through the analyzer, the offsets you get back appear to reflect the utf-8 offsets of the text AFTER the asciifolding filter has been applied. Can anybody confirm that this is correct? This is really annoying because originally, I thought using the _analyze endpoint would be a nice way to figure out the offsets of tokens in my original utf-8 byte stream. Obviously that's not going to work if the reported offsets are unrelated to what I fed Elasticsearch.

Part 2: This stackoverflow post suggests that the TermVector results are in UTF-16 offsets anyway. Is that correct? I'm surprised to learn that analysis and search have different concepts of what an offset it.

Any illumination appreciated.

Justin_Lee · August 23, 2016, 9:21pm

Sorry, I must have confused myself. Offsets are utf-16 in both cases and they seem to be reported correctly regardless of analysis chain. I'm not sure how I got it in my head that the offsets were utf-8.

Topic		Replies	Views
Start and end offset of a token in elasticsearch Elasticsearch	2	885	July 6, 2017
Getting Accented Text Indexed Properly Elasticsearch	4	1105	July 5, 2017
Upgrade from 6.8.1 to 7.8.1 causes token offset issues Elasticsearch	1	195	July 12, 2022
Emoji unicode characters and term vector offsets in elasticsearch-py in Python 3.4 Elasticsearch	1	604	July 5, 2017
Find offset of matched input query terms to the document Elasticsearch	1	452	February 12, 2018

Mapping offsets to matching tokens

Related topics