How to Index special characters and Search those special characters in Elasticsearch

Hi

I have been trying to fix this issue for more than 20 days , but couldn't make it working.
Also I am new to Elasticsearch as this is our first project to implement.

Step 1 :
I have Installed Elasticsearch 2.0 in Ubuntu 14.04. I able to create new Index using below code

$hosts = array('our ip address:9200');
$client = \Elasticsearch\ClientBuilder::create()->setHosts($hosts)->build();
$index = "IndexName";
$params['index'] = $index;
$params['type'] = 'xyz';
$params['body']["id"] = "1";
$params['body']["title"] = "C++ Developer - C# Developer";
$client->index($params);

once the above code runs Index successfully created.

Step 2 :
Able to look into the created Index using below link

`http://our ip address:9200/IndexName/_search?q=C%23&pretty

{
"took" : 30,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 9788,
"max_score" : 0.8968174,
"hits" : [ {
"_index" : "IndexName",
"_type" : "xyz",
"_id" : "1545680",
"_score" : 0.8968174,
"_source":{"id":"1545680","title":"C\+\+ and C\# \- Software Engineer"}
}, {
"_index" : "IndexName",
"_type" : "xyz",
"_id" : "1539778",
"_score" : 0.853807,
"_source":{"id":"1539778","title":"Rebaca Technologies Hiring in C\+\+"}
}
....

http://our ip address:9200/IndexName/_search?q=C%23&pretty

{
"took" : 30,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 9788,
"max_score" : 0.8968174,
"hits" : [ {
"_index" : "IndexName",
"_type" : "xyz",
"_id" : "1545680",
"_score" : 0.8968174,
"_source":{"id":"1545680","title":"C\+\+ and C\# \- Software Engineer"}
}, {
"_index" : "IndexName",
"_type" : "xyz",
"_id" : "1539778",
"_score" : 0.853807,
"_source":{"id":"1539778","title":"Rebaca Technologies Hiring in C\+\+"}
}
....

If you note the above search result i am getting 2nd result which is not having c#. Even i am getting the same result for search "C" only

I am not getting relavant search result according to the keywords which contains special characters like +, #, or .

I am preserving the special characters as per the below guide

Escaping Special Characters

Lucene supports escaping special characters that are part of the query syntax. The current list special characters are

+ - && || ! ( ) { } [ ] ^ " ~ * ? : \
`
To escape these character use the \ before the character. For example to search for (1+1):2 use the query:

`(1+1):2

`I added # in the group of escape charaters.

Step 3:

In php while passing the special characters into Elasticsearch search function i am escaping like below

$keyword = str_replace(""",'"',$keyword);
$keyword = str_replace("+","+",$keyword);
$keyword = str_replace(".",".",$keyword);
$keyword = str_replace("#","#",$keyword);
$keyword = str_replace("/","/",$keyword);
$keyword = trim($keyword);

$params['body']['query']['query_string'] = array("query" => $keyword,"default_operator" => "AND" ,"fields" => array("title"));
$client->search($params); `

Please help me how to make the special character work

Thanks

1 Like

Hi,

you're looking in the wrong spot. Your problem is related to something called analysis. I suggest you read more about analysis in the definitive guide.

By default, Elasticsearch uses the "standard" analyzer to analyze text. You can try this by yourself in Sense:

GET /_analyze?analyzer=standard
{
    "text": "C# developer"
}

This produces:

{
   "tokens": [
      {
         "token": "c",
         "start_offset": 6,
         "end_offset": 7,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "developer",
         "start_offset": 9,
         "end_offset": 18,
         "type": "<ALPHANUM>",
         "position": 2
      }
   ]
}

You can see that Elasticsearch's standard analyzer just strips the "#" character (and similarly "++"). The analyzer is applied at index time so your text never makes it into the index as you want it.

Hence, one solution to this problem is to define your own analyzer. Here is a minimal example that should get you going:

First, we create a custom analyzer. We use the whitespace tokenizer here but you should check the documentation on custom analyzers and decide whether this really fits your use case.

PUT /my_index
{
   "settings": {
      "analysis": {
         "analyzer": {
            "my_analyzer": {
               "type": "custom",
               "filter": [
                  "lowercase"
               ],
               "tokenizer": "whitespace"
            }
         }
      }
   }
}

We can already try that now the special characters are preserved:

GET /my_index/_analyze?analyzer=my_analyzer
{
    "text": "C# developer"
}

This produces:

{
   "tokens": [
      {
         "token": "c#",
         "start_offset": 0,
         "end_offset": 2,
         "type": "word",
         "position": 0
      },
      {
         "token": "developer",
         "start_offset": 3,
         "end_offset": 12,
         "type": "word",
         "position": 1
      }
   ]
}

Note that "c#" is still present as a token. This is key to understand the rest.

Now we have to use our custom analyzer. For that we define a new type called "jobs":

PUT /my_index/_mapping/jobs
{
   "properties": {
      "content": {
         "type": "string",
         "analyzer": "my_analyzer"
      }
   }
}

We can now index some documents:

POST /_bulk
{"index":{"_index":"my_index","_type":"jobs"}}
{"content":"We are looking for C++ and C# developers"}
{"index":{"_index":"my_index","_type":"jobs"}}
{"content":"We are looking for C developers"}
{"index":{"_index":"my_index","_type":"jobs"}}
{"content":"We are looking for project managers"}

And if we search now for "C#":

GET /my_index/jobs/_search
{
   "query": {
      "match": {
         "content": {
            "query": "C#"
         }
      }
   }
}

we get the expected result:

{
   "took": 3,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 0.095891505,
      "hits": [
         {
            "_index": "my_index",
            "_type": "jobs",
            "_id": "AVMyfdxBfIbbKEiejUJ3",
            "_score": 0.095891505,
            "_source": {
               "content": "We are looking for C++ and C# developers"
            }
         }
      ]
   }
}

I can heartily recommend the Definitive Guide to get a deeper understanding of Elasticsearch.

Daniel

13 Likes

Hi Daniel,

I like your solution.
However how can i apply the customized analyzer and the job for new indexes dynamically created every day?

I am using logstash for that.

Thanks, Reddy

1 Like

Hi @Rudolf_Reddy_Macejka,

you can use Index Templates for doing just that.

2 Likes

Thank you @cbuescher,

created

PUT _template/all_whitespace_only
{
"template": "*",
"version": 1,
"settings": {
"number_of_shards": 1,
"analysis": {
"analyzer": {
"anal_whitespace_only": {
"type": "custom",
"filter": [
"lowercase"
],
"tokenizer": "whitespace"
}
}
}
},
"mappings": {
"jobs": {
"properties": {
"content": {
"type": "string",
"analyzer": "anal_whitespace_only"
}
}
}
}
}

Will provide the result then,

Rudo

1 Like

Hello,

finally I can continue. the above template is applicable to the field content.

But how I can create it for all string fields generaly?

I have found a solution and after amending it seems by following:

PUT _template/all_whitespace_only
{
  "template": "tracking*",
  "version": 1,
  "settings": {
    "number_of_shards": 1,
    "analysis": {
      "analyzer": {
        "default": {
           "type": "custom",
             "filter": [
                "lowercase"
             ],
              "tokenizer": "whitespace"
            }
         }
      }
  },
  "mappings": {
    "_default_": {
      "dynamic_templates": {
        "string_fields": {
          "match": "*",
          "match_mapping_type": "string",
          "mapping": {
            "type": "string",
            "index": "analyzed",
            "analyzer": "anal_whitespace_only",
            "fielddata": {
                "format": "disabled"
            }
          }
        }
      }
    }
  }
}

However I am getting error

Failed to parse mapping [_default_]: java.util.LinkedHashMap cannot be cast to java.util.List", "caused_by"=>{"type"=>"class_cast_exception", "reason"=>"java.util.LinkedHashMap cannot be cast to java.util.List"}}}}, :level=>:warn}←[0m

Be honest I do not understand what is wrong.

Can you please help?

Thank you!
Reddy

1 Like

Which client are you using? The error looks like you're not putting the template using curl or the sense plugin, but I might be mistaken. I suspect there's an error using the client.

1 Like

It is error from logstash. Here is my configuration:

input { 
  jdbc { jdbc_driver_library => "e:\Utility\elasticsearch-2.1.0\addons\logstash-2.1.0\lib\ojdbc6.jar"
                     jdbc_driver_class => "Java::oracle.jdbc.driver.OracleDriver"
           jdbc_connection_string    => "jdbc:oracle:thin:@*******:1521/****"
           jdbc_user   => "*****"
           jdbc_password => "*******"
                     parameters => { }
                     schedule => "08 * * * *"
                     statement => "select eg.identifier id, al.DATA_CORRELATIONID IdSourceMessage, eg.SNRF IdMasterMessage, 'EMEA DCG MB UAT' sourceSystem, 'EMEA' as region, 'DCG MB UAT' as platform, to_char(eg.datetime, 'YYYY-MM-DD\"T\"HH24:MI') timestamp, eg.sender Sender, eg.receiver Receiver, eg.aprf APRF, eg.snrf SNRF, eg.doctrackid trackingID, decode(substr(lower(eg.Status),1,11),'translation',trim(REGEXP_REPLACE(eg.Status,'[[:alpha:]]'))) SessionID, eg.status Status, eg.text Action, eg.details Details  from gxsmailbox_uat.tgeg_log eg left join dcgplatform_uat.gl_utils_audit_log al on eg.doctrackid = substr(al.data_message,instr(lower(al.data_message),'indentifier')+13,18) where eg.class = 'DATA' and   eg.datetime > sysdate-1.1/24 order by eg.identifier"
      } 
}

filter { 
    mutate { 
        gsub => [
            "timestamp","[\\]", ""
         ] 
    }
}


output {
  elasticsearch {    "index" => "tracking-%{+YYYY.MM.dd}"
                                      "document_type" => "eg"
                                    "document_id" => "eg-uat-%{id}" 
    }

}
1 Like

According to the documentation, dynamic_templates needs to be an array. Does this solve your problem?

1 Like

Will check .,.. and let you know

1 Like

Hi,

i have removed all mappings from the template and remain only the analyzer part and it works as I wanted ... I tried to too much combine things and thinks :slight_smile:

thank you.

1 Like

Tip: to scape all characters in python, do:

>>> q = "your_search+here-now&"
>>> q = re.sub(pattern=r'([+\-=&|><(){}\[\]\^"~*?:\/])', repl=r'\\\1', string=q)
>>> print q 
your_search\+here\-now\&
1 Like

That's really helpful.Thanks

1 Like