How to Index special characters and Search those special characters in Elasticsearch


(Aravinthan Asokan) #1

Hi

I have been trying to fix this issue for more than 20 days , but couldn't make it working.
Also I am new to Elasticsearch as this is our first project to implement.

Step 1 :
I have Installed Elasticsearch 2.0 in Ubuntu 14.04. I able to create new Index using below code

$hosts = array('our ip address:9200');
$client = \Elasticsearch\ClientBuilder::create()->setHosts($hosts)->build();
$index = "IndexName";
$params['index'] = $index;
$params['type'] = 'xyz';
$params['body']["id"] = "1";
$params['body']["title"] = "C++ Developer - C# Developer";
$client->index($params);

once the above code runs Index successfully created.

Step 2 :
Able to look into the created Index using below link

`http://our ip address:9200/IndexName/_search?q=C%23&pretty

{
"took" : 30,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 9788,
"max_score" : 0.8968174,
"hits" : [ {
"_index" : "IndexName",
"_type" : "xyz",
"_id" : "1545680",
"_score" : 0.8968174,
"_source":{"id":"1545680","title":"C\+\+ and C\# \- Software Engineer"}
}, {
"_index" : "IndexName",
"_type" : "xyz",
"_id" : "1539778",
"_score" : 0.853807,
"_source":{"id":"1539778","title":"Rebaca Technologies Hiring in C\+\+"}
}
....

http://our ip address:9200/IndexName/_search?q=C%23&pretty

{
"took" : 30,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 9788,
"max_score" : 0.8968174,
"hits" : [ {
"_index" : "IndexName",
"_type" : "xyz",
"_id" : "1545680",
"_score" : 0.8968174,
"_source":{"id":"1545680","title":"C\+\+ and C\# \- Software Engineer"}
}, {
"_index" : "IndexName",
"_type" : "xyz",
"_id" : "1539778",
"_score" : 0.853807,
"_source":{"id":"1539778","title":"Rebaca Technologies Hiring in C\+\+"}
}
....

If you note the above search result i am getting 2nd result which is not having c#. Even i am getting the same result for search "C" only

I am not getting relavant search result according to the keywords which contains special characters like +, #, or .

I am preserving the special characters as per the below guide

Escaping Special Characters

Lucene supports escaping special characters that are part of the query syntax. The current list special characters are

+ - && || ! ( ) { } [ ] ^ " ~ * ? : \
`
To escape these character use the \ before the character. For example to search for (1+1):2 use the query:

`(1+1):2

`I added # in the group of escape charaters.

Step 3:

In php while passing the special characters into Elasticsearch search function i am escaping like below

$keyword = str_replace(""",'"',$keyword);
$keyword = str_replace("+","+",$keyword);
$keyword = str_replace(".",".",$keyword);
$keyword = str_replace("#","#",$keyword);
$keyword = str_replace("/","/",$keyword);
$keyword = trim($keyword);

$params['body']['query']['query_string'] = array("query" => $keyword,"default_operator" => "AND" ,"fields" => array("title"));
$client->search($params); `

Please help me how to make the special character work

Thanks


Search for phrases with special characters - Attachment Content
(Daniel Mitterdorfer) #2

Hi,

you're looking in the wrong spot. Your problem is related to something called analysis. I suggest you read more about analysis in the definitive guide.

By default, Elasticsearch uses the "standard" analyzer to analyze text. You can try this by yourself in Sense:

GET /_analyze?analyzer=standard
{
    "text": "C# developer"
}

This produces:

{
   "tokens": [
      {
         "token": "c",
         "start_offset": 6,
         "end_offset": 7,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "developer",
         "start_offset": 9,
         "end_offset": 18,
         "type": "<ALPHANUM>",
         "position": 2
      }
   ]
}

You can see that Elasticsearch's standard analyzer just strips the "#" character (and similarly "++"). The analyzer is applied at index time so your text never makes it into the index as you want it.

Hence, one solution to this problem is to define your own analyzer. Here is a minimal example that should get you going:

First, we create a custom analyzer. We use the whitespace tokenizer here but you should check the documentation on custom analyzers and decide whether this really fits your use case.

PUT /my_index
{
   "settings": {
      "analysis": {
         "analyzer": {
            "my_analyzer": {
               "type": "custom",
               "filter": [
                  "lowercase"
               ],
               "tokenizer": "whitespace"
            }
         }
      }
   }
}

We can already try that now the special characters are preserved:

GET /my_index/_analyze?analyzer=my_analyzer
{
    "text": "C# developer"
}

This produces:

{
   "tokens": [
      {
         "token": "c#",
         "start_offset": 0,
         "end_offset": 2,
         "type": "word",
         "position": 0
      },
      {
         "token": "developer",
         "start_offset": 3,
         "end_offset": 12,
         "type": "word",
         "position": 1
      }
   ]
}

Note that "c#" is still present as a token. This is key to understand the rest.

Now we have to use our custom analyzer. For that we define a new type called "jobs":

PUT /my_index/_mapping/jobs
{
   "properties": {
      "content": {
         "type": "string",
         "analyzer": "my_analyzer"
      }
   }
}

We can now index some documents:

POST /_bulk
{"index":{"_index":"my_index","_type":"jobs"}}
{"content":"We are looking for C++ and C# developers"}
{"index":{"_index":"my_index","_type":"jobs"}}
{"content":"We are looking for C developers"}
{"index":{"_index":"my_index","_type":"jobs"}}
{"content":"We are looking for project managers"}

And if we search now for "C#":

GET /my_index/jobs/_search
{
   "query": {
      "match": {
         "content": {
            "query": "C#"
         }
      }
   }
}

we get the expected result:

{
   "took": 3,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 0.095891505,
      "hits": [
         {
            "_index": "my_index",
            "_type": "jobs",
            "_id": "AVMyfdxBfIbbKEiejUJ3",
            "_score": 0.095891505,
            "_source": {
               "content": "We are looking for C++ and C# developers"
            }
         }
      ]
   }
}

I can heartily recommend the Definitive Guide to get a deeper understanding of Elasticsearch.

Daniel


(Rudolf “Reddy” Macejka) #3

Hi Daniel,

I like your solution.
However how can i apply the customized analyzer and the job for new indexes dynamically created every day?

I am using logstash for that.

Thanks, Reddy


(Christoph) #4

Hi @Rudolf_Reddy_Macejka,

you can use Index Templates for doing just that.


(Rudolf “Reddy” Macejka) #6

Thank you @cbuescher,

created

PUT _template/all_whitespace_only
{
"template": "*",
"version": 1,
"settings": {
"number_of_shards": 1,
"analysis": {
"analyzer": {
"anal_whitespace_only": {
"type": "custom",
"filter": [
"lowercase"
],
"tokenizer": "whitespace"
}
}
}
},
"mappings": {
"jobs": {
"properties": {
"content": {
"type": "string",
"analyzer": "anal_whitespace_only"
}
}
}
}
}

Will provide the result then,

Rudo


(Rudolf “Reddy” Macejka) #7

Hello,

finally I can continue. the above template is applicable to the field content.

But how I can create it for all string fields generaly?

I have found a solution and after amending it seems by following:

PUT _template/all_whitespace_only
{
  "template": "tracking*",
  "version": 1,
  "settings": {
    "number_of_shards": 1,
    "analysis": {
      "analyzer": {
        "default": {
           "type": "custom",
             "filter": [
                "lowercase"
             ],
              "tokenizer": "whitespace"
            }
         }
      }
  },
  "mappings": {
    "_default_": {
      "dynamic_templates": {
        "string_fields": {
          "match": "*",
          "match_mapping_type": "string",
          "mapping": {
            "type": "string",
            "index": "analyzed",
            "analyzer": "anal_whitespace_only",
            "fielddata": {
                "format": "disabled"
            }
          }
        }
      }
    }
  }
}

However I am getting error

Failed to parse mapping [_default_]: java.util.LinkedHashMap cannot be cast to java.util.List", "caused_by"=>{"type"=>"class_cast_exception", "reason"=>"java.util.LinkedHashMap cannot be cast to java.util.List"}}}}, :level=>:warn}←[0m

Be honest I do not understand what is wrong.

Can you please help?

Thank you!
Reddy


(Christoph) #8

Which client are you using? The error looks like you're not putting the template using curl or the sense plugin, but I might be mistaken. I suspect there's an error using the client.


(Rudolf “Reddy” Macejka) #9

It is error from logstash. Here is my configuration:

input { 
  jdbc { jdbc_driver_library => "e:\Utility\elasticsearch-2.1.0\addons\logstash-2.1.0\lib\ojdbc6.jar"
                     jdbc_driver_class => "Java::oracle.jdbc.driver.OracleDriver"
           jdbc_connection_string    => "jdbc:oracle:thin:@*******:1521/****"
           jdbc_user   => "*****"
           jdbc_password => "*******"
                     parameters => { }
                     schedule => "08 * * * *"
                     statement => "select eg.identifier id, al.DATA_CORRELATIONID IdSourceMessage, eg.SNRF IdMasterMessage, 'EMEA DCG MB UAT' sourceSystem, 'EMEA' as region, 'DCG MB UAT' as platform, to_char(eg.datetime, 'YYYY-MM-DD\"T\"HH24:MI') timestamp, eg.sender Sender, eg.receiver Receiver, eg.aprf APRF, eg.snrf SNRF, eg.doctrackid trackingID, decode(substr(lower(eg.Status),1,11),'translation',trim(REGEXP_REPLACE(eg.Status,'[[:alpha:]]'))) SessionID, eg.status Status, eg.text Action, eg.details Details  from gxsmailbox_uat.tgeg_log eg left join dcgplatform_uat.gl_utils_audit_log al on eg.doctrackid = substr(al.data_message,instr(lower(al.data_message),'indentifier')+13,18) where eg.class = 'DATA' and   eg.datetime > sysdate-1.1/24 order by eg.identifier"
      } 
}

filter { 
    mutate { 
        gsub => [
            "timestamp","[\\]", ""
         ] 
    }
}


output {
  elasticsearch {    "index" => "tracking-%{+YYYY.MM.dd}"
                                      "document_type" => "eg"
                                    "document_id" => "eg-uat-%{id}" 
    }

}

(Christoph) #10

According to the documentation, dynamic_templates needs to be an array. Does this solve your problem?


(Rudolf “Reddy” Macejka) #11

Will check .,.. and let you know


(Rudolf “Reddy” Macejka) #12

Hi,

i have removed all mappings from the template and remain only the analyzer part and it works as I wanted ... I tried to too much combine things and thinks :slight_smile:

thank you.


(Mateus Pádua) #13

Tip: to scape all characters in python, do:

>>> q = "your_search+here-now&"
>>> q = re.sub(pattern=r'([+\-=&|><(){}\[\]\^"~*?:\/])', repl=r'\\\1', string=q)
>>> print q 
your_search\+here\-now\&

(Shellbye Bai) #14

That's really helpful.Thanks


(system) #15