Indexing Special characters as symbols and searching with unicode values

(Prem Kumar) #1

I am Trying to index and search for Special characters in elastic search
I used White space tokenizer and i am able to index Special characters and search them fine.

But i have a situation where i need to index the special character Symbol and search it using its equivalent unicode values
I am indexing the below document

   "id": 1,
   "documentId": "334567"
   "fieldValue": [
         "fieldId": 175699,
         "textValue": [{
         "paragraph":"γcomp"// this a lowercased gamma symbol
         "integerValue": "",
         "numericValue": "",
         "modifiedDate": "2010-01-01",
         "modifiedUser": "Tr"

Now i want to search "γcomp" Using "γ" unicode value "&#947comp" but it is not working

Can anybody please help with this?

(Zachary Tong) #2

There is not automatic conversion of HTML entities into their corresponding unicode. Elasticsearch expects UTF-8 encoded strings, so encoding schemes like HTML entities and url-encoding just look like valid UTF-8 and are indexed/searched as they are.

If you need to convert HTML entities into their corresponding unicode characters, I'd probably just run the conversion in my application.

If you have a relatively small list of characters you need to convert on a regular basis, you can use the char_filter as described here:

(system) #3