Four byte characters not handled correctly by Word Delimiter Token Filter

Daverino · March 16, 2018, 4:46am

According to the docs, the Word Delimiter Token Filter should, by default, split on any non alpha-numeric character.

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-word-delimiter-tokenfilter.html

However, I've found that the WDTF does not split when encountering a two character UTF-8 symbol, such as an emoji. I'd consider an emoji to be a non alpha-numeric character. In the case of one character UTF-8 emojis and symbols, the WDTF acts exactly as I would expect, and splits the token.

For example using UTF-8 0xF0 0x9F 0x98 0x82

/_analyze
{
	"tokenizer" : "whitespace",
	"filter" : ["word_delimiter"],
	"text" : "two character emo😂ji"
}

gives the output

{
    "tokens": [
        {
            "token": "two",
            "start_offset": 0,
            "end_offset": 3,
            "type": "word",
            "position": 0
        },
        {
            "token": "character",
            "start_offset": 4,
            "end_offset": 13,
            "type": "word",
            "position": 1
        },
        {
            "token": "emo😂ji",
            "start_offset": 14,
            "end_offset": 21,
            "type": "word",
            "position": 2
        }
    ]
}

Whereas using UTF-8 0xE2 0x9D 0xA4

/_analyze
{
	"tokenizer" : "whitespace",
	"filter" : ["word_delimiter"],
	"text" : "one character emo❤️ji"
}

gives me the results I would expect

{
    "tokens": [
        {
            "token": "one",
            "start_offset": 0,
            "end_offset": 3,
            "type": "word",
            "position": 0
        },
        {
            "token": "character",
            "start_offset": 4,
            "end_offset": 13,
            "type": "word",
            "position": 1
        },
        {
            "token": "emo",
            "start_offset": 14,
            "end_offset": 17,
            "type": "word",
            "position": 2
        },
        {
            "token": "️ji",
            "start_offset": 18,
            "end_offset": 21,
            "type": "word",
            "position": 3
        }
    ]
}

Is there a reason for why the first character in a two character emoji would be considered 'alpha-numeric' for purposes of the WDTF? As emojis are more and more common in text, this can give a really inconsistent result from ES.

system · April 13, 2018, 4:46am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Word delimiter filter with preserve_original Elasticsearch	4	654	December 18, 2019
Word_delimiter behaviour using match query with operator and Elasticsearch	1	203	September 26, 2022
Truncate token filter splits 32bits character Elasticsearch	4	399	July 29, 2019
Word delimiter Elasticsearch	3	659	July 6, 2017
Problem with token delimiter and regular expression Elasticsearch	2	600	July 6, 2017

Four byte characters not handled correctly by Word Delimiter Token Filter

Related topics