Elasticsearch filter comparison with "preserve_original": true

nkachami · March 29, 2024, 5:30pm

Hello,

I am testing the outputs of different analyzer configurations and found one that does not seem to make much sense.

Lets imagine we have two settings blocks:

{
  "settings": {
    "analysis": {
      "analyzer": {
        "default": {
          "tokenizer": "standard",
          "filter": ["lowercase", "ascii_folding_original_preserving"]
        },
      },
      "filter": {
        "ascii_folding_original_preserving": {
          "type": "asciifolding",
          "preserve_original": true
        }
      }
    }
  }
}

and

{
  "settings": {
    "analysis": {
      "analyzer": {
        "default": {
          "tokenizer": "icu_tokenizer",
          "filter": ["lowercase", "icu_folding_original_preserving"]
        },
      },
      "filter": {
        "icu_folding_original_preserving": {
          "type": "icu_folding",
          "preserve_original": true
        }
      }
    }
  }
}

Lets assume we run both of these analyze API requests with full width characters:

GET /_analyze
{
  "tokenizer": "standard",
  "filter": ["lowercase", "ascii_folding_original_preserving"],
  "text":"Ｃｕｌｔｕｒｅ ｏｆ Ｊａｐａｎ"
}

GET /_analyze
{
  "tokenizer": "icu_tokenizer",
  "filter": ["lowercase", "icu_folding_original_preserving"],
  "text":"Ｃｕｌｔｕｒｅ ｏｆ Ｊａｐａｎ"
}

Based on the documentation the icu_folding filter is a "Case folding of Unicode characters based on UTR#30 , like the ASCII-folding token filter on steroids."

With that being said, I expected the same exact response with the correct one being the 1st analysis result below but instead received:

1st analysis result with ascii_folding_original_preserving filter

{
  "tokens" : [
    {
      "token" : "culture",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "ｃｕｌｔｕｒｅ",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "of",
      "start_offset" : 8,
      "end_offset" : 10,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "ｏｆ",
      "start_offset" : 8,
      "end_offset" : 10,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "japan",
      "start_offset" : 11,
      "end_offset" : 16,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "ｊａｐａｎ",
      "start_offset" : 11,
      "end_offset" : 16,
      "type" : "<ALPHANUM>",
      "position" : 2
    }
  ]
}

2nd analysis result with icu_folding_original_preserving filter:

{
  "tokens" : [
    {
      "token" : "culture",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "of",
      "start_offset" : 8,
      "end_offset" : 10,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "japan",
      "start_offset" : 11,
      "end_offset" : 16,
      "type" : "<ALPHANUM>",
      "position" : 2
    }
  ]
}

system · April 26, 2024, 5:31pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Fuzzy in searchs with asciifolding Elasticsearch	1	570	December 4, 2018
Configurable ASCIIFolding and CharReplace filters done Elasticsearch	8	1462	July 6, 2017
Lang (czech) analyzer with asciifolding tokenizer or icu_tokenizer Elasticsearch	10	1204	July 6, 2017
Question about asciifolding filter Elasticsearch	3	583	July 6, 2017
U-umlaut search --> indexing user name müller , search fails for müller but success for muller Elasticsearch	6	6365	July 5, 2017

Elasticsearch filter comparison with "preserve_original": true

Related topics