I'm playing with the strangest bug ever in ES. As I didn't clearly identified the root cause, I'm trying the forum first.
My analysis do:
a char filter to replace <3 by ❤
a synonym token filter to extend ❤ to ❤ and love.
If my content is I ❤ ES, my tokens are ok and I have a "love" token.
But if my content is I <3 ES, I only found ❤ in the tokens, not "love"!
Aren't the char_filter supposed to be executed first?
Also note that it does not happens with other replacements. If I replace all "ES" by "Elasticsearch" in the char_filter, and then have synonyms for "Elasticsearch", they will be indexed.
I try to understand, you want to mix char_filter and synonym filter?
This does not work, you are correct. I can only speculate about the reason. Maybe char_filter has difficulties to pass tokens to synonym filter, or synonym filter can not pick up tokens.
Why not use synonym filter alone, without char_filter, like this?
Glad you can see the issue too, I'm not crazy yet should I submit a ticket on github?
Plus, I can't use your solution because I have a filter removing punctuation from tokens before the synonym filter, allowing to handle emoji trapped in quotes, or snapped to a punctuation sign (The whitespace tokenizer produce a ❤. token for the input Es is ❤., the dot is an issue).
As <3 and emoticons in general are mostly punctuation, I will be left with no match on the synonyms phase. That's why the char_filter was supposed to be handy
I'm not sure if you want just a demo solution for <3, , and 'love', it feels you want a general solution for all emoticons and what they mean, to get them translated into entities which can be searched.
If so, it is something that is far more than just fiddling with char_filter and tokenizer.
Maybe you want emoticon segmentation and part-of-speech tagging. I know two solutions, but none of them is turnkey-ready for Elasticsearch:
Alternative 1 is based on Java Pattern construction, while alternative 2 is more Lucene-focused and looks promising to get wrapped into an Elasticsearch tokenizer.
I do, I'm building an emoji capable search engine, and came across this <3 => ❤ bug when trying to deal with emoticons. I'm publishing my research and solutions pretty soon, and it works as expected except for , which is why I created this thread.
Seems to me there is a bug somewhere, I need to find some time to crack the case and maybe open an issue about it.
Thanks for the links to POS tokenizer, that's awesome
The <3 issue seems to have disappear with my latest tests, it may be thanks to the new clean-up I'm doing (see the char_filters). Thanks again for the help you provided me.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.