According to the docs, the Word Delimiter Token Filter should, by default, split on any non alpha-numeric character.
However, I've found that the WDTF does not split when encountering a two character UTF-8 symbol, such as an emoji. I'd consider an emoji to be a non alpha-numeric character. In the case of one character UTF-8 emojis and symbols, the WDTF acts exactly as I would expect, and splits the token.
For example using UTF-8 0xF0 0x9F 0x98 0x82
/_analyze
{
"tokenizer" : "whitespace",
"filter" : ["word_delimiter"],
"text" : "two character emo😂ji"
}
gives the output
{
"tokens": [
{
"token": "two",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0
},
{
"token": "character",
"start_offset": 4,
"end_offset": 13,
"type": "word",
"position": 1
},
{
"token": "emo😂ji",
"start_offset": 14,
"end_offset": 21,
"type": "word",
"position": 2
}
]
}
Whereas using UTF-8 0xE2 0x9D 0xA4
/_analyze
{
"tokenizer" : "whitespace",
"filter" : ["word_delimiter"],
"text" : "one character emo❤️ji"
}
gives me the results I would expect
{
"tokens": [
{
"token": "one",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0
},
{
"token": "character",
"start_offset": 4,
"end_offset": 13,
"type": "word",
"position": 1
},
{
"token": "emo",
"start_offset": 14,
"end_offset": 17,
"type": "word",
"position": 2
},
{
"token": "️ji",
"start_offset": 18,
"end_offset": 21,
"type": "word",
"position": 3
}
]
}
Is there a reason for why the first character in a two character emoji would be considered 'alpha-numeric' for purposes of the WDTF? As emojis are more and more common in text, this can give a really inconsistent result from ES.