Getting Accented Text Indexed Properly

dadepo · December 30, 2015, 7:08pm

Following the article here You have an accent https://www.elastic.co/guide/en/elasticsearch/guide/current/asciifolding-token-filter.html I added the following analysis to my index:

PUT /blog
{
  "settings": {
    "analysis": {
      "analyzer": {
        "folding": {
          "tokenizer": "standard",
          "filter":  [ "lowercase", "asciifolding" ]
        }
      }
    }
  }
}

and according to the article when I test the analysis out like this:

GET /my_index?analyzer=folding
My œsophagus caused a débâcle

should yield this:

my, oesophagus, caused, a, debacle

But this is not what I get. Instead I get the following output:

{
   "tokens": [
      {
         "token": "my",
         "start_offset": 0,
         "end_offset": 2,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "sophagus",
         "start_offset": 4,
         "end_offset": 12,
         "type": "<ALPHANUM>",
         "position": 2
      },
      {
         "token": "caused",
         "start_offset": 13,
         "end_offset": 19,
         "type": "<ALPHANUM>",
         "position": 3
      },
      {
         "token": "a",
         "start_offset": 20,
         "end_offset": 21,
         "type": "<ALPHANUM>",
         "position": 4
      },
      {
         "token": "d",
         "start_offset": 22,
         "end_offset": 23,
         "type": "<ALPHANUM>",
         "position": 5
      },
      {
         "token": "b",
         "start_offset": 24,
         "end_offset": 25,
         "type": "<ALPHANUM>",
         "position": 6
      },
      {
         "token": "cle",
         "start_offset": 26,
         "end_offset": 29,
         "type": "<ALPHANUM>",
         "position": 7
      }
   ]
}

Any idea why the débâcle get's broken down the way it does on my machine?

jprante · December 30, 2015, 8:48pm

asciifolding is for ASCII only.

You should use the ICU tokenizer/token filter

https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-icu.html

dadepo · December 31, 2015, 5:41am

Hi @jprante, Thanks for the suggestion, would check it out, but still that does not explain why I am getting a different result from the article I linked to.

jprante · December 31, 2015, 9:54am

It must be

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "folding": {
          "tokenizer": "standard",
          "filter":  [ "lowercase", "asciifolding" ]
        }
      }
    }
  }
}

POST /my_index/_analyze?analyzer=folding
My œsophagus caused a débâcle

Note that you must pass UTF-8 in the POST body.

But, for processing all kind of Unicode characters and for using correct folding, you should use ICU folding in tokenizer/tokenizer filter.

Topic		Replies	Views
Issue with asciiFolding filter and accents Elasticsearch	3	946	July 5, 2017
Index analyzer problem with accent! Elasticsearch	1	365	July 6, 2017
Confused about when and how asciifolding happens Elasticsearch	1	311	July 6, 2017
Problems with Ascii Folding text with Accents Elasticsearch	4	1678	July 5, 2017
Lang (czech) analyzer with asciifolding tokenizer or icu_tokenizer Elasticsearch	10	1210	July 6, 2017

Getting Accented Text Indexed Properly

Related topics