Emoji unicode characters and term vector offsets in elasticsearch-py in Python 3.4


(Patrick Lam) #1

Hi,

I'm using Python 3.4 with the elasticsearch-py client. When my text fields include emojis, they seem to screw up the term vector offsets. My understanding is that this has something to do with how certain python clients encode unicode characters.

For example,

>> es.indices.analyze(body='\U0001f64f testing')
{'tokens': [{'end_offset': 10,
   'position': 1,
   'start_offset': 3,
   'token': 'testing',
   'type': '<ALPHANUM>'}]}

>> '\U0001f64f testing'[3:10]
'esting'

I get this error in Python 3.4 and Python 2.7.9, but not 2.7.10. Is there a way to get around this in Python 3.4? I want to be able to match the start and stop offsets from the retrieved term vectors to the position indexing in Python.


(system) #2