Querying smileys in tweets using Elasticsearch


(reza sadoddin) #1

Hello,
I have trouble specifying a query which search for some smileys in tweets. Can anyone show me an example of a working query? I have tried several ideas for escaping a simple smiley like ":)", but none of them can capture a smiley.

Just to mention that I am indexing tweet data using Twitter river plugin.
Thank you,

~R


(Christoph) #2

You probably indexed your tweets without specifying an analyzer. In this case the default Standard Analyzer is used, which in turn uses the Standard Tokenizer. This tokenizer will filter out characters you usually find in smileys (like ":;))" etc...), so you can't serach for them later.

Try adding a mapping which preserves punctuation, e.g. the Whitespace Analyzer. Here is a short example in Sense notation:

DELETE /_all

PUT /test

PUT /test/_mapping/my_type
{
    "my_type": {
        "properties": {
            "text": {
                "type":     "string",
                "analyzer": "whitespace"
            }
        }
    }
}

POST /test/my_type
{
    "text": "Don't worry, be :) now!"
}

GET /test/my_type/_search
{
  "query": {
    "match": {
      "text": ":)"
    }
  }
}

Which should give you

"hits": [
         {
            "_index": "test",
            "_type": "my_type",
            "_id": "AU5JRT98rcrDj8N_41cb",
            "_score": 0.13424811,
            "_source": {
               "text": "Don't worry, be :) now!"
            }
         }
      ]

You can find more information about controlling anaysis here: https://www.elastic.co/guide/en/elasticsearch/guide/current/_controlling_analysis.html


(system) #3