How to Index special characters and Search those special characters in Elasticsearch

Hi,

you're looking in the wrong spot. Your problem is related to something called analysis. I suggest you read more about analysis in the definitive guide.

By default, Elasticsearch uses the "standard" analyzer to analyze text. You can try this by yourself in Sense:

GET /_analyze?analyzer=standard
{
    "text": "C# developer"
}

This produces:

{
   "tokens": [
      {
         "token": "c",
         "start_offset": 6,
         "end_offset": 7,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "developer",
         "start_offset": 9,
         "end_offset": 18,
         "type": "<ALPHANUM>",
         "position": 2
      }
   ]
}

You can see that Elasticsearch's standard analyzer just strips the "#" character (and similarly "++"). The analyzer is applied at index time so your text never makes it into the index as you want it.

Hence, one solution to this problem is to define your own analyzer. Here is a minimal example that should get you going:

First, we create a custom analyzer. We use the whitespace tokenizer here but you should check the documentation on custom analyzers and decide whether this really fits your use case.

PUT /my_index
{
   "settings": {
      "analysis": {
         "analyzer": {
            "my_analyzer": {
               "type": "custom",
               "filter": [
                  "lowercase"
               ],
               "tokenizer": "whitespace"
            }
         }
      }
   }
}

We can already try that now the special characters are preserved:

GET /my_index/_analyze?analyzer=my_analyzer
{
    "text": "C# developer"
}

This produces:

{
   "tokens": [
      {
         "token": "c#",
         "start_offset": 0,
         "end_offset": 2,
         "type": "word",
         "position": 0
      },
      {
         "token": "developer",
         "start_offset": 3,
         "end_offset": 12,
         "type": "word",
         "position": 1
      }
   ]
}

Note that "c#" is still present as a token. This is key to understand the rest.

Now we have to use our custom analyzer. For that we define a new type called "jobs":

PUT /my_index/_mapping/jobs
{
   "properties": {
      "content": {
         "type": "string",
         "analyzer": "my_analyzer"
      }
   }
}

We can now index some documents:

POST /_bulk
{"index":{"_index":"my_index","_type":"jobs"}}
{"content":"We are looking for C++ and C# developers"}
{"index":{"_index":"my_index","_type":"jobs"}}
{"content":"We are looking for C developers"}
{"index":{"_index":"my_index","_type":"jobs"}}
{"content":"We are looking for project managers"}

And if we search now for "C#":

GET /my_index/jobs/_search
{
   "query": {
      "match": {
         "content": {
            "query": "C#"
         }
      }
   }
}

we get the expected result:

{
   "took": 3,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 0.095891505,
      "hits": [
         {
            "_index": "my_index",
            "_type": "jobs",
            "_id": "AVMyfdxBfIbbKEiejUJ3",
            "_score": 0.095891505,
            "_source": {
               "content": "We are looking for C++ and C# developers"
            }
         }
      ]
   }
}

I can heartily recommend the Definitive Guide to get a deeper understanding of Elasticsearch.

Daniel

13 Likes