Issue with asciiFolding filter and accents


#1

Hi,

i have an issue with the how asciiFolding filter works ...

I explain :

I have an analyzer

"folding": {
       "tokenizer": "standard",
      "filter":  ["asciifolding" ]
 }

I thought (in french) the tokens for pate and pâte will be the same --> pate without accent

But no

GET /cac/_analyze?analyzer=folding&text=pate
{
   "tokens": [
      {
         "token": "pate",
         "start_offset": 0,
         "end_offset": 4,
         "type": "<ALPHANUM>",
         "position": 1
      }
   ]
}

AND

GET /cac/_analyze?analyzer=folding&text=pâte
{
   "tokens": [
      {
         "token": "p",
         "start_offset": 0,
         "end_offset": 1,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "te",
         "start_offset": 2,
         "end_offset": 4,
         "type": "<ALPHANUM>",
         "position": 2
      }
   ]
}

Why i hav two tokens with the second word with accent ? :frowning: I have searched a lot but nothing :frowning: all my tests are bad !

Thank you for your help !


(David Pilato) #2

Well. Be careful with the tool you are using to send those tests.

It must be sent in UTF-8 otherwise the standard analyzer might produce bad results.

For example, on my french laptop with curl, I get:

curl -XGET "http://localhost:9200/_analyze?tokenizer=standard&text=pâte&pretty"
{
  "tokens" : [ {
    "token" : "pᅢᄁte",
    "start_offset" : 0,
    "end_offset" : 5,
    "type" : "<ALPHANUM>",
    "position" : 1
  } ]
}

#3

yes no soucy i used tools with utf-8 :slight_smile:


(system) #4