Searching for Emoji Characters / Unicode

Ben_M · October 5, 2015, 11:36am

I have some articles in my ES index that contain emoji characters. I'd like to perform a search for articles that contain specific emojis. Via ElasticHQ I can see the emojis are in the data - they are rendered as icons in OS X, so I assume the unicode data is stored correctly. However, when I run the query below I get no results. If I run a plain text search I do get my results. From my very limited experience of ES (1 day) I'm guessing I need to add an Analyzer that can handle this? I don't know where to start or if this is a correct diagnosis.

Best,
Ben

  client.search({
            index: 'articles',
            body: {
                fields: ["code", "title"],
                query: {
                    query_string: {
                        query:"😕"
                    }
                }
            }
        }, function(error, response) {
            // handling error / response here ...
        });
        return;
    }
    
    //results: {"took":6,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":0,"max_score":null,"hits":[]}}

rmuir · October 5, 2015, 11:49am

The default analyzer (StandardAnalyzer) uses the unicode word break algorithm (http://unicode.org/reports/tr29/).

The properties assigned to emoji are just "other", so they are treated no differently than "trash" like ^, , etc. Sorry this would be my personal opinion of them, too.

Anyway, yes you will need a custom analyzer if you want to make sense of emoji. You will have to decide how to make sense of them, e.g. if each should be its own word, or if someone writes 87 smileys in a row, what should happen then, and so on.

Ben_M · October 5, 2015, 12:02pm

We work in the messenger app space, so emoji is a character set we need to support. For now I'd be happy to search for individual emoji characters. And maybe later on have something more complex that understands the context of a series of emoji. Which analyzer would I use in the basic scenario and where would I learn to integrate it? Many thanks.

rmuir · October 5, 2015, 1:51pm

There is not such an analyzer. This stuff is hardly stable and not really standardized yet, so there is not yet well-accepted best practices and so on. You may have to write custom code, unless you want to do something very simple like use mappingcharfilter with mappings like -> DVD

There are a lot of possibilities depending on use case, but its nowhere near the maturity level where we have incorporated anything in lucene, ready to support backwards compatibility for it, etc.

You can get some ideas from http://unicode.org/reports/tr51/#Searching and think about how you want it to work for your app.

Ben_M · October 5, 2015, 2:27pm

Thanks so much for your help. Ben

Damien_Alexandre · March 16, 2016, 8:40am

For my needs, I used a whitespace tokenizer with quite a lot of customizations. Did you came up with a good solution?

Topic		Replies	Views
Searching based on Emoji Elasticsearch	6	3536	November 4, 2022
Indexing Special characters as symbols and searching with unicode values Elasticsearch	2	5468	July 5, 2017
Special characters search in elastic search Elasticsearch	6	472	July 6, 2017
Special Characters not indexed and hence not searchable Elasticsearch	9	2806	July 6, 2017
Dumb question- using the cjk analyzer Elasticsearch	3	620	July 6, 2017

Searching for Emoji Characters / Unicode

Related topics