unicodeSetFilter in analysis-icu ignored

Barsk · March 25, 2016, 11:24am

Upgraded to ES 2.2.1 from a very old 0.18 installation and I have run inte a problem.
I have two environments where I have configured an analysis-icu analyzer. In the first envoronment (win 7) everytyhing works perfectly, but in my other environment (Liniux) the unicodeSetFilter parameter is ignored. The elasticsearch.yml file looks like this and is utf-8 encoded:

index :
    analysis :
        analyzer : 
           swedishIcuFoldingAnalyzer :
               type : custom
               tokenizer : standard
               filter : [icuFolding, lowercase, swedish_stop]
   
        filter :
           swedish_stop :
                 type : stop
                 stopwords :  _swedish_ 
           icuFolding :
                type : icu_folding
                unicodeSetFilter : "[^åäöÅÄÖ]"

When I query for the text "modrar" it will match "mödrar" in the Linux environment while on Windows it will not - as expected.

Any clues...?

jprante · March 25, 2016, 5:37pm

I don't think it makes sense to use the standard tokenizer with ICU. Instead, I recommend the icu_tokenizer

Also, as a side node, you should always apply stop word filter first, before lowercase or folding.

If the ICU plugin by Elastic really does not work, which would be very strange, I can offer an alternative implementation at https://github.com/jprante/elasticsearch-plugin-bundle/ where I just added an ICU folding filter test that succeeds like you have described https://github.com/jprante/elasticsearch-plugin-bundle/blob/403480349d5caf055e835c0dc3cf8bc798ec9359/src/test/java/org/xbib/elasticsearch/index/analysis/icu/IcuFoldingFilterTests.java

Barsk · March 26, 2016, 11:22am

You may be correct with the icu_tokenizer instead of standard, but it makes no difference here.
Also the ordering of filters are a good point, but I believe lowercase should come before stopwords, right? I mean, all the stopwords are in lowercase.

When it comes to the analysis-icu plugin. It does work, The problem is that it is ignoring, or misinterpreting the unicodeSetFilter parameter. This is strange.

Barsk · March 29, 2016, 11:29am

Well, it turned out to be a stupid error on my part. I had two instances of ES running apparently on the same machine and restarting the service had no effect since there was another instance blocking the reloads...

It seems ES is quietly just finding the next unoccupied port number and will not even give a warning that the standard port is blocked.

jprante · March 30, 2016, 12:43pm

That's correct, this is the intended behavior. There is no "standard port" but a port range (9200-9300, 9300-9400). It is supposed to ease demonstrations with multicast, so you can just start many ES processes on the same machine as you wish, or on the same network, they will find and form a cluster. Since multicast is gone, this behavior seems weird.

Topic		Replies	Views
ICU Folding for Latin Subscript Letters Elasticsearch	1	425	July 16, 2019
ICU exclude lowercase filter Elasticsearch	1	605	July 5, 2017
Elasticsearch filter comparison with "preserve_original": true Elasticsearch	1	94	April 26, 2024
Need help: Analysis ICU plugin elasticSearch 2.2 java api problem Elasticsearch	2	1134	July 5, 2017
ICU and upgrading from 7.17.1 to 8.5 Elasticsearch	2	224	November 30, 2022

unicodeSetFilter in analysis-icu ignored

Related topics