Indexing Special characters as symbols and searching with unicode values


(Prem Kumar) #1

I am Trying to index and search for Special characters in elastic search
I used White space tokenizer and i am able to index Special characters and search them fine.

But i have a situation where i need to index the special character Symbol and search it using its equivalent unicode values
Example:
I am indexing the below document

{
   "id": 1,
   "documentId": "334567"
   "fieldValue": [
      {
         "fieldId": 175699,
         "textValue": [{
         "paragraph":"@doc"
         },{
         "paragraph":"γcomp"// this a lowercased gamma symbol
         },
         {
         "paragraph":"@Keyboard"
         }
         ],
         "integerValue": "",
         "numericValue": "",
         "modifiedDate": "2010-01-01",
         "modifiedUser": "Tr"
      }
   ]
}

Now i want to search "γcomp" Using "γ" unicode value "&#947comp" but it is not working

Can anybody please help with this?


(Zachary Tong) #2

There is not automatic conversion of HTML entities into their corresponding unicode. Elasticsearch expects UTF-8 encoded strings, so encoding schemes like HTML entities and url-encoding just look like valid UTF-8 and are indexed/searched as they are.

If you need to convert HTML entities into their corresponding unicode characters, I'd probably just run the conversion in my application.

If you have a relatively small list of characters you need to convert on a regular basis, you can use the char_filter as described here: https://www.elastic.co/guide/en/elasticsearch/guide/current/char-filters.html#_tidying_up_punctuation


(system) #3