How to Index special characters and Search those special characters in Elasticsearch

Aravinthan_Asokan · February 23, 2016, 3:01pm

Hi

I have been trying to fix this issue for more than 20 days , but couldn't make it working.
Also I am new to Elasticsearch as this is our first project to implement.

Step 1 :
I have Installed Elasticsearch 2.0 in Ubuntu 14.04. I able to create new Index using below code

$hosts = array('our ip address:9200');
$client = \Elasticsearch\ClientBuilder::create()->setHosts($hosts)->build();
$index = "IndexName";
$params['index'] = $index;
$params['type'] = 'xyz';
$params['body']["id"] = "1";
$params['body']["title"] = "C++ Developer - C# Developer";
$client->index($params);

once the above code runs Index successfully created.

Step 2 :
Able to look into the created Index using below link

`http://our ip address:9200/IndexName/_search?q=C%23&pretty

{
"took" : 30,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 9788,
"max_score" : 0.8968174,
"hits" : [ {
"_index" : "IndexName",
"_type" : "xyz",
"_id" : "1545680",
"_score" : 0.8968174,
"_source":{"id":"1545680","title":"C\+\+ and C\# \- Software Engineer"}
}, {
"_index" : "IndexName",
"_type" : "xyz",
"_id" : "1539778",
"_score" : 0.853807,
"_source":{"id":"1539778","title":"Rebaca Technologies Hiring in C\+\+"}
}
....

http://our ip address:9200/IndexName/_search?q=C%23&pretty

{
"took" : 30,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 9788,
"max_score" : 0.8968174,
"hits" : [ {
"_index" : "IndexName",
"_type" : "xyz",
"_id" : "1545680",
"_score" : 0.8968174,
"_source":{"id":"1545680","title":"C\+\+ and C\# \- Software Engineer"}
}, {
"_index" : "IndexName",
"_type" : "xyz",
"_id" : "1539778",
"_score" : 0.853807,
"_source":{"id":"1539778","title":"Rebaca Technologies Hiring in C\+\+"}
}
....

If you note the above search result i am getting 2nd result which is not having c#. Even i am getting the same result for search "C" only

I am not getting relavant search result according to the keywords which contains special characters like +, #, or .

I am preserving the special characters as per the below guide

Escaping Special Characters

Lucene supports escaping special characters that are part of the query syntax. The current list special characters are

+ - && || ! ( ) { } [ ] ^ " ~ * ? : \
`
To escape these character use the \ before the character. For example to search for (1+1):2 use the query:

`(1+1):2

`I added # in the group of escape charaters.

Step 3:

In php while passing the special characters into Elasticsearch search function i am escaping like below

$keyword = str_replace(""",'"',$keyword);
$keyword = str_replace("+","+",$keyword);
$keyword = str_replace(".",".",$keyword);
$keyword = str_replace("#","#",$keyword);
$keyword = str_replace("/","/",$keyword);
$keyword = trim($keyword);

$params['body']['query']['query_string'] = array("query" => $keyword,"default_operator" => "AND" ,"fields" => array("title"));
$client->search($params); `

Please help me how to make the special character work

Thanks

danielmitterdorfer · March 1, 2016, 2:14pm

Hi,

you're looking in the wrong spot. Your problem is related to something called analysis. I suggest you read more about analysis in the definitive guide.

By default, Elasticsearch uses the "standard" analyzer to analyze text. You can try this by yourself in Sense:

GET /_analyze?analyzer=standard
{
    "text": "C# developer"
}

This produces:

{
   "tokens": [
      {
         "token": "c",
         "start_offset": 6,
         "end_offset": 7,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "developer",
         "start_offset": 9,
         "end_offset": 18,
         "type": "<ALPHANUM>",
         "position": 2
      }
   ]
}

You can see that Elasticsearch's standard analyzer just strips the "#" character (and similarly "++"). The analyzer is applied at index time so your text never makes it into the index as you want it.

Hence, one solution to this problem is to define your own analyzer. Here is a minimal example that should get you going:

First, we create a custom analyzer. We use the whitespace tokenizer here but you should check the documentation on custom analyzers and decide whether this really fits your use case.

PUT /my_index
{
   "settings": {
      "analysis": {
         "analyzer": {
            "my_analyzer": {
               "type": "custom",
               "filter": [
                  "lowercase"
               ],
               "tokenizer": "whitespace"
            }
         }
      }
   }
}

We can already try that now the special characters are preserved:

GET /my_index/_analyze?analyzer=my_analyzer
{
    "text": "C# developer"
}

This produces:

{
   "tokens": [
      {
         "token": "c#",
         "start_offset": 0,
         "end_offset": 2,
         "type": "word",
         "position": 0
      },
      {
         "token": "developer",
         "start_offset": 3,
         "end_offset": 12,
         "type": "word",
         "position": 1
      }
   ]
}

Note that "c#" is still present as a token. This is key to understand the rest.

Now we have to use our custom analyzer. For that we define a new type called "jobs":

PUT /my_index/_mapping/jobs
{
   "properties": {
      "content": {
         "type": "string",
         "analyzer": "my_analyzer"
      }
   }
}

We can now index some documents:

POST /_bulk
{"index":{"_index":"my_index","_type":"jobs"}}
{"content":"We are looking for C++ and C# developers"}
{"index":{"_index":"my_index","_type":"jobs"}}
{"content":"We are looking for C developers"}
{"index":{"_index":"my_index","_type":"jobs"}}
{"content":"We are looking for project managers"}

And if we search now for "C#":

GET /my_index/jobs/_search
{
   "query": {
      "match": {
         "content": {
            "query": "C#"
         }
      }
   }
}

we get the expected result:

{
   "took": 3,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 0.095891505,
      "hits": [
         {
            "_index": "my_index",
            "_type": "jobs",
            "_id": "AVMyfdxBfIbbKEiejUJ3",
            "_score": 0.095891505,
            "_source": {
               "content": "We are looking for C++ and C# developers"
            }
         }
      ]
   }
}

I can heartily recommend the Definitive Guide to get a deeper understanding of Elasticsearch.

Daniel

Rudolf_Reddy_Macejka · November 1, 2016, 2:50pm

Hi Daniel,

I like your solution.
However how can i apply the customized analyzer and the job for new indexes dynamically created every day?

I am using logstash for that.

Thanks, Reddy

cbuescher · November 1, 2016, 2:58pm

Hi @Rudolf_Reddy_Macejka,

you can use Index Templates for doing just that.

Rudolf_Reddy_Macejka · November 1, 2016, 3:21pm

Thank you @cbuescher,

created

PUT _template/all_whitespace_only
{
"template": "*",
"version": 1,
"settings": {
"number_of_shards": 1,
"analysis": {
"analyzer": {
"anal_whitespace_only": {
"type": "custom",
"filter": [
"lowercase"
],
"tokenizer": "whitespace"
}
}
}
},
"mappings": {
"jobs": {
"properties": {
"content": {
"type": "string",
"analyzer": "anal_whitespace_only"
}
}
}
}
}

Will provide the result then,

Rudo

Rudolf_Reddy_Macejka · November 21, 2016, 12:58pm

Hello,

finally I can continue. the above template is applicable to the field content.

But how I can create it for all string fields generaly?

I have found a solution and after amending it seems by following:

PUT _template/all_whitespace_only
{
  "template": "tracking*",
  "version": 1,
  "settings": {
    "number_of_shards": 1,
    "analysis": {
      "analyzer": {
        "default": {
           "type": "custom",
             "filter": [
                "lowercase"
             ],
              "tokenizer": "whitespace"
            }
         }
      }
  },
  "mappings": {
    "_default_": {
      "dynamic_templates": {
        "string_fields": {
          "match": "*",
          "match_mapping_type": "string",
          "mapping": {
            "type": "string",
            "index": "analyzed",
            "analyzer": "anal_whitespace_only",
            "fielddata": {
                "format": "disabled"
            }
          }
        }
      }
    }
  }
}

However I am getting error

Failed to parse mapping [_default_]: java.util.LinkedHashMap cannot be cast to java.util.List", "caused_by"=>{"type"=>"class_cast_exception", "reason"=>"java.util.LinkedHashMap cannot be cast to java.util.List"}}}}, :level=>:warn}←[0m

Be honest I do not understand what is wrong.

Can you please help?

Thank you!
Reddy

cbuescher · November 21, 2016, 1:04pm

Which client are you using? The error looks like you're not putting the template using curl or the sense plugin, but I might be mistaken. I suspect there's an error using the client.

Rudolf_Reddy_Macejka · November 21, 2016, 1:09pm

It is error from logstash. Here is my configuration:

input { 
  jdbc { jdbc_driver_library => "e:\Utility\elasticsearch-2.1.0\addons\logstash-2.1.0\lib\ojdbc6.jar"
                     jdbc_driver_class => "Java::oracle.jdbc.driver.OracleDriver"
           jdbc_connection_string    => "jdbc:oracle:thin:@*******:1521/****"
           jdbc_user   => "*****"
           jdbc_password => "*******"
                     parameters => { }
                     schedule => "08 * * * *"
                     statement => "select eg.identifier id, al.DATA_CORRELATIONID IdSourceMessage, eg.SNRF IdMasterMessage, 'EMEA DCG MB UAT' sourceSystem, 'EMEA' as region, 'DCG MB UAT' as platform, to_char(eg.datetime, 'YYYY-MM-DD\"T\"HH24:MI') timestamp, eg.sender Sender, eg.receiver Receiver, eg.aprf APRF, eg.snrf SNRF, eg.doctrackid trackingID, decode(substr(lower(eg.Status),1,11),'translation',trim(REGEXP_REPLACE(eg.Status,'[[:alpha:]]'))) SessionID, eg.status Status, eg.text Action, eg.details Details  from gxsmailbox_uat.tgeg_log eg left join dcgplatform_uat.gl_utils_audit_log al on eg.doctrackid = substr(al.data_message,instr(lower(al.data_message),'indentifier')+13,18) where eg.class = 'DATA' and   eg.datetime > sysdate-1.1/24 order by eg.identifier"
      } 
}

filter { 
    mutate { 
        gsub => [
            "timestamp","[\\]", ""
         ] 
    }
}


output {
  elasticsearch {    "index" => "tracking-%{+YYYY.MM.dd}"
                                      "document_type" => "eg"
                                    "document_id" => "eg-uat-%{id}" 
    }

}

cbuescher · November 21, 2016, 2:28pm

According to the documentation, dynamic_templates needs to be an array. Does this solve your problem?

Rudolf_Reddy_Macejka · November 21, 2016, 2:39pm

Will check .,.. and let you know

Rudolf_Reddy_Macejka · November 22, 2016, 8:28am

Hi,

i have removed all mappings from the template and remain only the analyzer part and it works as I wanted ... I tried to too much combine things and thinks

thank you.

mateuspadua · March 31, 2017, 11:47am

Tip: to scape all characters in python, do:

>>> q = "your_search+here-now&"
>>> q = re.sub(pattern=r'([+\-=&|><(){}\[\]\^"~*?:\/])', repl=r'\\\1', string=q)
>>> print q 
your_search\+here\-now\&

Shellbye_Bai · April 5, 2017, 12:20pm

That's really helpful.Thanks

Topic		Replies	Views
Special characters search in elastic search Elasticsearch	6	539	July 6, 2017
Searching on special characters Elasticsearch	3	2960	July 6, 2017
Searching Special Charactes like &%*@()!{} etc in 0.13 Elasticsearch	2	417	July 6, 2017
Analyzer for special characters Elasticsearch	3	2562	July 6, 2017
Search with special caracters Elasticsearch	6	434	July 6, 2017

How to Index special characters and Search those special characters in Elasticsearch

Related topics