Dealing with diacritic values (Example :- Lehtelä)

Hi all,
I have some values which has special symbols which helps sounding of the word.
The problem is these values are getting encoded to base64 while indexing, for example above term is indexed as TGVodGVsw6Q= , and if my query is vod its retrieving the document with TGVodGVsw6Q , which is wrong.
How do i address this situation.
I want the value to be stored as it is (Lehtelä) and still be searchable.
By the way i also tried this filter

"filter": {
"my_ascii_folding": {
"type": "asciifolding",
"preserve_original": "true"
}

and it did not work
Thanks

Hey,

I'm sorry, but I do not fully understand the issue here, as it seems to me, that two problems are intermingled here. Please correct me, where I am wrong

  1. Content seems to be in base64. This is superbad and should be fixed. You can decode this using an ingest processor. See this example
POST /_ingest/pipeline/_simulate
{
  "pipeline" :
  {
    "description": "_description",
    "processors": [
      {
        "script": {
          "lang": "painless",
          "source": "ctx.decoded = ctx.field.decodeBase64()"
        }
      }
    ]
  },
  "docs": [
    {
      "_index": "index",
      "_id": "id",
      "_source": {
        "field" : "TGVodGVsw6Q="
      }
    }
  ]
}
  1. Ascii folding is not working as expected
GET _analyze
{
  "text": "Lehtelä",
  "filter": [ { "type" : "asciifolding", "preserve_original": "true"}]
}

{
  "tokens" : [
    {
      "token" : "Lehtela",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "Lehtelä",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "word",
      "position" : 0
    }
  ]
}

that looks as expected to me.

Happy to see some more clarification on this. What you cannot do, is just storing the base64 value and expect everything to work magically. You need to do some conversion.

--Alex

Hi,
Thanks for your reply.
I was able to figure out the problem, The data for elasticsearch i took it from an LDAP server. LDAP was actually encoding those non ASCII values.
I think i should fix it from LDAP side.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.