Elasticsearch filter comparison with "preserve_original": true

Hello,

I am testing the outputs of different analyzer configurations and found one that does not seem to make much sense.

Lets imagine we have two settings blocks:

{
  "settings": {
    "analysis": {
      "analyzer": {
        "default": {
          "tokenizer": "standard",
          "filter": ["lowercase", "ascii_folding_original_preserving"]
        },
      },
      "filter": {
        "ascii_folding_original_preserving": {
          "type": "asciifolding",
          "preserve_original": true
        }
      }
    }
  }
}

and

{
  "settings": {
    "analysis": {
      "analyzer": {
        "default": {
          "tokenizer": "icu_tokenizer",
          "filter": ["lowercase", "icu_folding_original_preserving"]
        },
      },
      "filter": {
        "icu_folding_original_preserving": {
          "type": "icu_folding",
          "preserve_original": true
        }
      }
    }
  }
}

Lets assume we run both of these analyze API requests with full width characters:

GET /_analyze
{
  "tokenizer": "standard",
  "filter": ["lowercase", "ascii_folding_original_preserving"],
  "text":"Culture of Japan"
}

GET /_analyze
{
  "tokenizer": "icu_tokenizer",
  "filter": ["lowercase", "icu_folding_original_preserving"],
  "text":"Culture of Japan"
}

Based on the documentation the icu_folding filter is a "Case folding of Unicode characters based on UTR#30 , like the ASCII-folding token filter on steroids."

With that being said, I expected the same exact response with the correct one being the 1st analysis result below but instead received:

1st analysis result with ascii_folding_original_preserving filter

{
  "tokens" : [
    {
      "token" : "culture",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "culture",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "of",
      "start_offset" : 8,
      "end_offset" : 10,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "of",
      "start_offset" : 8,
      "end_offset" : 10,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "japan",
      "start_offset" : 11,
      "end_offset" : 16,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "japan",
      "start_offset" : 11,
      "end_offset" : 16,
      "type" : "<ALPHANUM>",
      "position" : 2
    }
  ]
}

2nd analysis result with icu_folding_original_preserving filter:

{
  "tokens" : [
    {
      "token" : "culture",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "of",
      "start_offset" : 8,
      "end_offset" : 10,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "japan",
      "start_offset" : 11,
      "end_offset" : 16,
      "type" : "<ALPHANUM>",
      "position" : 2
    }
  ]
}

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.