How to Index special characters and Search those special characters in Elasticsearch

danielmitterdorfer · March 1, 2016, 2:14pm

Hi,

you're looking in the wrong spot. Your problem is related to something called analysis. I suggest you read more about analysis in the definitive guide.

By default, Elasticsearch uses the "standard" analyzer to analyze text. You can try this by yourself in Sense:

GET /_analyze?analyzer=standard
{
    "text": "C# developer"
}

This produces:

{
   "tokens": [
      {
         "token": "c",
         "start_offset": 6,
         "end_offset": 7,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "developer",
         "start_offset": 9,
         "end_offset": 18,
         "type": "<ALPHANUM>",
         "position": 2
      }
   ]
}

You can see that Elasticsearch's standard analyzer just strips the "#" character (and similarly "++"). The analyzer is applied at index time so your text never makes it into the index as you want it.

Hence, one solution to this problem is to define your own analyzer. Here is a minimal example that should get you going:

First, we create a custom analyzer. We use the whitespace tokenizer here but you should check the documentation on custom analyzers and decide whether this really fits your use case.

PUT /my_index
{
   "settings": {
      "analysis": {
         "analyzer": {
            "my_analyzer": {
               "type": "custom",
               "filter": [
                  "lowercase"
               ],
               "tokenizer": "whitespace"
            }
         }
      }
   }
}

We can already try that now the special characters are preserved:

GET /my_index/_analyze?analyzer=my_analyzer
{
    "text": "C# developer"
}

This produces:

{
   "tokens": [
      {
         "token": "c#",
         "start_offset": 0,
         "end_offset": 2,
         "type": "word",
         "position": 0
      },
      {
         "token": "developer",
         "start_offset": 3,
         "end_offset": 12,
         "type": "word",
         "position": 1
      }
   ]
}

Note that "c#" is still present as a token. This is key to understand the rest.

Now we have to use our custom analyzer. For that we define a new type called "jobs":

PUT /my_index/_mapping/jobs
{
   "properties": {
      "content": {
         "type": "string",
         "analyzer": "my_analyzer"
      }
   }
}

We can now index some documents:

POST /_bulk
{"index":{"_index":"my_index","_type":"jobs"}}
{"content":"We are looking for C++ and C# developers"}
{"index":{"_index":"my_index","_type":"jobs"}}
{"content":"We are looking for C developers"}
{"index":{"_index":"my_index","_type":"jobs"}}
{"content":"We are looking for project managers"}

And if we search now for "C#":

GET /my_index/jobs/_search
{
   "query": {
      "match": {
         "content": {
            "query": "C#"
         }
      }
   }
}

we get the expected result:

{
   "took": 3,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 0.095891505,
      "hits": [
         {
            "_index": "my_index",
            "_type": "jobs",
            "_id": "AVMyfdxBfIbbKEiejUJ3",
            "_score": 0.095891505,
            "_source": {
               "content": "We are looking for C++ and C# developers"
            }
         }
      ]
   }
}

I can heartily recommend the Definitive Guide to get a deeper understanding of Elasticsearch.

Daniel

Topic		Replies	Views
I want to index elasticsearch query Elasticsearch	2	338	July 6, 2017
Special Characters not indexed and hence not searchable Elasticsearch	9	2805	July 6, 2017
Searching special characters in elastic Elasticsearch	4	189	April 10, 2024
Having some issues with special characters and searches [Solved] Elasticsearch	1	490	April 17, 2017
Group by not working for more than 1 word which contains space in between Elasticsearch	3	1708	July 5, 2017

How to Index special characters and Search those special characters in Elasticsearch

Related topics