Hello,
I am testing the outputs of different analyzer configurations and found one that does not seem to make much sense.
Lets imagine we have two settings blocks:
{
"settings": {
"analysis": {
"analyzer": {
"default": {
"tokenizer": "standard",
"filter": ["lowercase", "ascii_folding_original_preserving"]
},
},
"filter": {
"ascii_folding_original_preserving": {
"type": "asciifolding",
"preserve_original": true
}
}
}
}
}
and
{
"settings": {
"analysis": {
"analyzer": {
"default": {
"tokenizer": "icu_tokenizer",
"filter": ["lowercase", "icu_folding_original_preserving"]
},
},
"filter": {
"icu_folding_original_preserving": {
"type": "icu_folding",
"preserve_original": true
}
}
}
}
}
Lets assume we run both of these analyze API requests with full width characters:
GET /_analyze
{
"tokenizer": "standard",
"filter": ["lowercase", "ascii_folding_original_preserving"],
"text":"Culture of Japan"
}
GET /_analyze
{
"tokenizer": "icu_tokenizer",
"filter": ["lowercase", "icu_folding_original_preserving"],
"text":"Culture of Japan"
}
Based on the documentation the icu_folding filter is a "Case folding of Unicode characters based on UTR#30
, like the ASCII-folding token filter on steroids."
With that being said, I expected the same exact response with the correct one being the 1st analysis result below but instead received:
1st analysis result with ascii_folding_original_preserving filter
{
"tokens" : [
{
"token" : "culture",
"start_offset" : 0,
"end_offset" : 7,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "culture",
"start_offset" : 0,
"end_offset" : 7,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "of",
"start_offset" : 8,
"end_offset" : 10,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "of",
"start_offset" : 8,
"end_offset" : 10,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "japan",
"start_offset" : 11,
"end_offset" : 16,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "japan",
"start_offset" : 11,
"end_offset" : 16,
"type" : "<ALPHANUM>",
"position" : 2
}
]
}
2nd analysis result with icu_folding_original_preserving filter:
{
"tokens" : [
{
"token" : "culture",
"start_offset" : 0,
"end_offset" : 7,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "of",
"start_offset" : 8,
"end_offset" : 10,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "japan",
"start_offset" : 11,
"end_offset" : 16,
"type" : "<ALPHANUM>",
"position" : 2
}
]
}