Getting Accented Text Indexed Properly


(Dadepo) #1

Following the article here You have an accent https://www.elastic.co/guide/en/elasticsearch/guide/current/asciifolding-token-filter.html I added the following analysis to my index:

PUT /blog
{
  "settings": {
    "analysis": {
      "analyzer": {
        "folding": {
          "tokenizer": "standard",
          "filter":  [ "lowercase", "asciifolding" ]
        }
      }
    }
  }
}

and according to the article when I test the analysis out like this:

GET /my_index?analyzer=folding
My œsophagus caused a débâcle

should yield this:

my, oesophagus, caused, a, debacle

But this is not what I get. Instead I get the following output:

{
   "tokens": [
      {
         "token": "my",
         "start_offset": 0,
         "end_offset": 2,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "sophagus",
         "start_offset": 4,
         "end_offset": 12,
         "type": "<ALPHANUM>",
         "position": 2
      },
      {
         "token": "caused",
         "start_offset": 13,
         "end_offset": 19,
         "type": "<ALPHANUM>",
         "position": 3
      },
      {
         "token": "a",
         "start_offset": 20,
         "end_offset": 21,
         "type": "<ALPHANUM>",
         "position": 4
      },
      {
         "token": "d",
         "start_offset": 22,
         "end_offset": 23,
         "type": "<ALPHANUM>",
         "position": 5
      },
      {
         "token": "b",
         "start_offset": 24,
         "end_offset": 25,
         "type": "<ALPHANUM>",
         "position": 6
      },
      {
         "token": "cle",
         "start_offset": 26,
         "end_offset": 29,
         "type": "<ALPHANUM>",
         "position": 7
      }
   ]
}

Any idea why the débâcle get's broken down the way it does on my machine?


(Jörg Prante) #2

asciifolding is for ASCII only.

You should use the ICU tokenizer/token filter

https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-icu.html


(Dadepo) #3

Hi @jprante, Thanks for the suggestion, would check it out, but still that does not explain why I am getting a different result from the article I linked to.


(Jörg Prante) #4

It must be

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "folding": {
          "tokenizer": "standard",
          "filter":  [ "lowercase", "asciifolding" ]
        }
      }
    }
  }
}

POST /my_index/_analyze?analyzer=folding
My œsophagus caused a débâcle

Note that you must pass UTF-8 in the POST body.

But, for processing all kind of Unicode characters and for using correct folding, you should use ICU folding in tokenizer/tokenizer filter.


(system) #5