I have created a custom keyword analyzer which has ICU Folding Token filter. While using this analyzer I noticed that following subscript letters are not supported by ICU Folding:
- ᵢ [LATIN SUBSCRIPT SMALL LETTER I]
- ᵣ [LATIN SUBSCRIPT SMALL LETTER R]
- ᵤ [LATIN SUBSCRIPT SMALL LETTER U]
- ᵥ [LATIN SUBSCRIPT SMALL LETTER V]
ICU Folding filter return an empty string for these letters. I have installed analysis-icu plugin as instructed here:
https://www.elastic.co/guide/en/elasticsearch/plugins/6.6/analysis-icu.html
Example:
POST /<my_index>/_analyze?pretty
{ "analyzer": "my_custom_analyzer", "text": "Kėdaᵢnių" }
Response:
{"tokens" : [{"token" : "kedaniu", "start_offset" : 0, "end_offset" : 8, "type" : "word", "position" : 0 } ] }
As we can see above we've got a token kedaniu which should have been kedainiu
Following are the ES plugin directory (elasticsearch-6.6.2\plugins\analysis-icu) files for version information:
- analysis-icu-client-6.6.2.jar
- icu4j-62.1.jar
- LICENSE.txt
- lucene-analyzers-icu-7.6.0.jar
- NOTICE.txt
- plugin-descriptor.properties
Any thoughts/suggestions on this?
Few more letters that are not supported and we get empty string in the response:
- ‸ [CARET]
- ^ [FULLWIDTH CIRCUMFLEX ACCENT]
- ″ [DOUBLE PRIME]
- ‶ [REVERSED DOUBLE PRIME]