ICU Folding for Latin Subscript Letters

I have created a custom keyword analyzer which has ICU Folding Token filter. While using this analyzer I noticed that following subscript letters are not supported by ICU Folding:

  1. ᵢ [LATIN SUBSCRIPT SMALL LETTER I]
  2. ᵣ [LATIN SUBSCRIPT SMALL LETTER R]
  3. ᵤ [LATIN SUBSCRIPT SMALL LETTER U]
  4. ᵥ [LATIN SUBSCRIPT SMALL LETTER V]

ICU Folding filter return an empty string for these letters. I have installed analysis-icu plugin as instructed here:
https://www.elastic.co/guide/en/elasticsearch/plugins/6.6/analysis-icu.html

Example:
POST /<my_index>/_analyze?pretty
{ "analyzer": "my_custom_analyzer", "text": "Kėdaᵢnių" }

Response:
{"tokens" : [{"token" : "kedaniu", "start_offset" : 0, "end_offset" : 8, "type" : "word", "position" : 0 } ] }

As we can see above we've got a token kedaniu which should have been kedainiu

Following are the ES plugin directory (elasticsearch-6.6.2\plugins\analysis-icu) files for version information:

  • analysis-icu-client-6.6.2.jar
  • icu4j-62.1.jar
  • LICENSE.txt
  • lucene-analyzers-icu-7.6.0.jar
  • NOTICE.txt
  • plugin-descriptor.properties

Any thoughts/suggestions on this?

Few more letters that are not supported and we get empty string in the response:

  1. ‸ [CARET]
  2. ^ [FULLWIDTH CIRCUMFLEX ACCENT]
  3. ″ [DOUBLE PRIME]
  4. ‶ [REVERSED DOUBLE PRIME]

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.