ICU Folding for Latin Subscript Letters

tiwari.piyush7 · June 18, 2019, 11:45am

I have created a custom keyword analyzer which has ICU Folding Token filter. While using this analyzer I noticed that following subscript letters are not supported by ICU Folding:

ᵢ [LATIN SUBSCRIPT SMALL LETTER I]
ᵣ [LATIN SUBSCRIPT SMALL LETTER R]
ᵤ [LATIN SUBSCRIPT SMALL LETTER U]
ᵥ [LATIN SUBSCRIPT SMALL LETTER V]

ICU Folding filter return an empty string for these letters. I have installed analysis-icu plugin as instructed here:
https://www.elastic.co/guide/en/elasticsearch/plugins/6.6/analysis-icu.html

Example:
POST /<my_index>/_analyze?pretty
{ "analyzer": "my_custom_analyzer", "text": "Kėdaᵢnių" }

Response:
{"tokens" : [{"token" : "kedaniu", "start_offset" : 0, "end_offset" : 8, "type" : "word", "position" : 0 } ] }

As we can see above we've got a token kedaniu which should have been kedainiu

Following are the ES plugin directory (elasticsearch-6.6.2\plugins\analysis-icu) files for version information:

analysis-icu-client-6.6.2.jar
icu4j-62.1.jar
LICENSE.txt
lucene-analyzers-icu-7.6.0.jar
NOTICE.txt
plugin-descriptor.properties

Any thoughts/suggestions on this?

Few more letters that are not supported and we get empty string in the response:

‸ [CARET]
＾ [FULLWIDTH CIRCUMFLEX ACCENT]
″ [DOUBLE PRIME]
‶ [REVERSED DOUBLE PRIME]

system · July 16, 2019, 11:45am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
unicodeSetFilter in analysis-icu ignored Elasticsearch	5	1309	July 5, 2017
Facet filters with ICU folding? Elasticsearch	4	433	July 6, 2017
Icu_normalizer is not working Elasticsearch	0	17	August 13, 2024
Asciifolding character filter Elasticsearch	4	795	July 6, 2017
ICU Analysers for Elastic search Elasticsearch	5	1146	July 5, 2017

ICU Folding for Latin Subscript Letters

Related topics