Searching for Emoji Characters / Unicode


(Ben) #1

I have some articles in my ES index that contain emoji characters. I'd like to perform a search for articles that contain specific emojis. Via ElasticHQ I can see the emojis are in the data - they are rendered as icons in OS X, so I assume the unicode data is stored correctly. However, when I run the query below I get no results. If I run a plain text search I do get my results. From my very limited experience of ES (1 day) I'm guessing I need to add an Analyzer that can handle this? I don't know where to start or if this is a correct diagnosis.

Best,
Ben

  client.search({
            index: 'articles',
            body: {
                fields: ["code", "title"],
                query: {
                    query_string: {
                        query:"😕"
                    }
                }
            }
        }, function(error, response) {
            // handling error / response here ...
        });
        return;
    }
    
    //results: {"took":6,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":0,"max_score":null,"hits":[]}}

#2

The default analyzer (StandardAnalyzer) uses the unicode word break algorithm (http://unicode.org/reports/tr29/).

The properties assigned to emoji are just "other", so they are treated no differently than "trash" like ^, , etc. Sorry this would be my personal opinion of them, too.

Anyway, yes you will need a custom analyzer if you want to make sense of emoji. You will have to decide how to make sense of them, e.g. if each should be its own word, or if someone writes 87 smileys in a row, what should happen then, and so on.


(Ben) #3

We work in the messenger app space, so emoji is a character set we need to support. For now I'd be happy to search for individual emoji characters. And maybe later on have something more complex that understands the context of a series of emoji. Which analyzer would I use in the basic scenario and where would I learn to integrate it? Many thanks.


#4

There is not such an analyzer. This stuff is hardly stable and not really standardized yet, so there is not yet well-accepted best practices and so on. You may have to write custom code, unless you want to do something very simple like use mappingcharfilter with mappings like :dvd: -> DVD

There are a lot of possibilities depending on use case, but its nowhere near the maturity level where we have incorporated anything in lucene, ready to support backwards compatibility for it, etc.

You can get some ideas from http://unicode.org/reports/tr51/#Searching and think about how you want it to work for your app.


(Ben) #5

Thanks so much for your help. Ben


(Damien Alexandre) #6

For my needs, I used a whitespace tokenizer with quite a lot of customizations. Did you came up with a good solution?


(system) #7