ES can't find ❤️ love ❤️


(Damien Alexandre) #1

I'm playing with the strangest bug ever in ES. As I didn't clearly identified the root cause, I'm trying the forum first.

My analysis do:

  • a char filter to replace <3 by
  • a synonym token filter to extend to and love.

If my content is I ❤ ES, my tokens are ok and I have a "love" token.

But if my content is I <3 ES, I only found in the tokens, not "love"!

Aren't the char_filter supposed to be executed first?
Also note that it does not happens with other replacements. If I replace all "ES" by "Elasticsearch" in the char_filter, and then have synonyms for "Elasticsearch", they will be indexed.

Here is a "play" where I managed to reproduce the error: https://found.no/play/gist/2493ccf2f307d46359a0#analysis

You can see that where there is a <3, no "love" synonym appears, but the replacement did occurs.

Thanks for your help!


(Jörg Prante) #2

I can not reproduce this. In my version, there is no error.

All 3 searches give 3 hits.


(Damien Alexandre) #3

Thanks for looking into it!

Yes you get the hit, that's because <3 is translated to .

If you search only for <3, you can't find documents with love:

POST /test/doc/_search
{
    "query" : {
        "match" : {
            "text" : "<3"
        }
    }
}
# Only return <3 docs

That's because the synonyms are not applied to transformed emoticons :frowning:

You can see it, no "love" token here:

GET /test/_analyze?analyzer=myanalyzer
{
  "text": "I <3 ES"
} 

Tokens are "i", ":heart:" and "ES". Why not "love"! But it makes no sense to me, :heart: is supposed to pass in the synonyms token filter and be expended to "love".

:heart:


(Jörg Prante) #4

I try to understand, you want to mix char_filter and synonym filter?

This does not work, you are correct. I can only speculate about the reason. Maybe char_filter has difficulties to pass tokens to synonym filter, or synonym filter can not pick up tokens.

Why not use synonym filter alone, without char_filter, like this?


(Damien Alexandre) #5

Glad you can see the issue too, I'm not crazy yet :scream: should I submit a ticket on github?

Plus, I can't use your solution because I have a filter removing punctuation from tokens before the synonym filter, allowing to handle emoji trapped in quotes, or snapped to a punctuation sign (The whitespace tokenizer produce a ❤. token for the input Es is ❤., the dot is an issue).

As <3 and emoticons in general are mostly punctuation, I will be left with no match on the synonyms phase. That's why the char_filter was supposed to be handy :slightly_smiling:


(Jörg Prante) #6

I'm not sure if you want just a demo solution for <3, :heart:, and 'love', it feels you want a general solution for all emoticons and what they mean, to get them translated into entities which can be searched.

If so, it is something that is far more than just fiddling with char_filter and tokenizer.

Maybe you want emoticon segmentation and part-of-speech tagging. I know two solutions, but none of them is turnkey-ready for Elasticsearch:

  1. Tweet Natural Language Processing - Part-of-speech Tagger http://www.cs.cmu.edu/~ark/TweetNLP/#pos
  2. Twitter's text common package with a Lucene-like Tokenizer at https://github.com/twitter/commons/tree/master/src/java/com/twitter/common/text

Alternative 1 is based on Java Pattern construction, while alternative 2 is more Lucene-focused and looks promising to get wrapped into an Elasticsearch tokenizer.


(Damien Alexandre) #7

I do, I'm building an emoji capable search engine, and came across this <3 => ❤ bug when trying to deal with emoticons. I'm publishing my research and solutions pretty soon, and it works as expected except for :heart:, which is why I created this thread.
Seems to me there is a bug somewhere, I need to find some time to crack the case and maybe open an issue about it.

Thanks for the links to POS tokenizer, that's awesome :+1:


(Damien Alexandre) #8

Sorry for the double post,
I wanted to show you where I got with this whole emoji search situation.

I've published an article about my research: http://jolicode.com/blog/search-for-emoji-with-elasticsearch

The <3 issue seems to have disappear with my latest tests, it may be thanks to the new clean-up I'm doing (see the char_filters). Thanks again for the help you provided me.


(system) #9