Emoji unicode characters and term vector offsets in elasticsearch-py in Python 3.4

Patrick_Lam · September 25, 2015, 7:27am

Hi,

I'm using Python 3.4 with the elasticsearch-py client. When my text fields include emojis, they seem to screw up the term vector offsets. My understanding is that this has something to do with how certain python clients encode unicode characters.

For example,

>> es.indices.analyze(body='\U0001f64f testing')
{'tokens': [{'end_offset': 10,
   'position': 1,
   'start_offset': 3,
   'token': 'testing',
   'type': '<ALPHANUM>'}]}

>> '\U0001f64f testing'[3:10]
'esting'

I get this error in Python 3.4 and Python 2.7.9, but not 2.7.10. Is there a way to get around this in Python 3.4? I want to be able to match the start and stop offsets from the retrieved term vectors to the position indexing in Python.

Topic		Replies	Views
Mapping offsets to matching tokens Elasticsearch	2	661	July 5, 2017
Searching for Emoji Characters / Unicode Elasticsearch	6	3957	July 5, 2017
Term offsets in scripting returning "-1" Elasticsearch	3	1342	July 5, 2017
Searching based on Emoji Elasticsearch	6	3536	November 4, 2022
Howto: Access Character Offset of term in string field Elasticsearch	4	568	July 6, 2017

Emoji unicode characters and term vector offsets in elasticsearch-py in Python 3.4

Related topics