Analysis-icu plugin usage at client side

Hi,

I would like to use analysis-icu plugin (https://www.elastic.co/guide/en/elasticsearch/plugins/6.6/analysis-icu.html) for ICU Folding with ElasticSearch 6.6.2. While doing so, I was able to install analysis-icu plugin successfully on my elastic-search server.

After installation following are the ES plugin directory (elasticsearch-6.6.2\plugins\analysis-icu) files for version information:

  • analysis-icu-client-6.6.2.jar
  • icu4j-62.1.jar
  • LICENSE.txt
  • lucene-analyzers-icu-7.6.0.jar
  • NOTICE.txt
  • plugin-descriptor.properties

This is just first half of the story. The second half is to apply the SAME analysis-icu plugin on the client side to generate the tokens for given input text for searching. In order to do that I included following maven dependency (https://mvnrepository.com/artifact/org.elasticsearch.plugin/analysis-icu) in my java project:

<dependency>
    <groupId>org.elasticsearch.plugin</groupId>
    <artifactId>analysis-icu</artifactId>
    <version>5.0.0-alpha5</version>
</dependency>

However above analysis-icu maven dependency uses a different version of lucene-analyzers-icu jar which is 6.1.0 (lucene-analyzers-icu-6.1.0.jar) not aligned with what is installed on the server which is 7.6.0 (lucene-analyzers-icu-7.6.0.jar). Following is the snapshot of pom.xml of analysis-icu plugin:

<dependency>
   <groupId>org.apache.lucene</groupId>
   <artifactId>lucene-analyzers-icu</artifactId>
   <version>6.1.0</version>
   <scope>compile</scope>
   <exclusions>
      <exclusion>
         <groupId>org.apache.lucene</groupId>
         <artifactId>lucene-analyzers-common</artifactId>
      </exclusion>
      <exclusion>
         <groupId>org.apache.lucene</groupId>
         <artifactId>lucene-core</artifactId>
      </exclusion>
      <exclusion>
         <groupId>com.ibm.icu</groupId>
         <artifactId>icu4j</artifactId>
      </exclusion>
   </exclusions>
</dependency>

How do we ensure that the analysis-icu plugin version used at client side is in sync with what is installed on server?

1 Like

In one case you are using 6.6.2 but then declare 5.0.0-alpha5. That said I don't see why you think you need to add this lib to your application. Any reason?

I think you might have missed this link (ICU Analysis Plugin | Elasticsearch Plugins and Integrations [6.6] | Elastic) of installing analysis-icu plugin on Elasticsearch server.

This link instructs to use following command for plugin installation:

sudo bin/elasticsearch-plugin install analysis-icu

However to use the same plugin on client side I had to use maven to pull out the plugin dependency which has a different version of lucene (6.1.0). I didn't find any version of this plugin which bundles lucene version (7.6.0).

In one case you are using 6.6.2 but then declare 5.0.0-alpha5.

Because maven does not have 6.6.2 or any other compatible version of the plugin

Why I need this library to be added to my application?

Because we have some fields in the ES document in which we have indexed data by applying icu_folding token filter. In order to match/locate the documents by those fields we need to apply the same icu folding token filter on client side as well.

Example:

  • Actual data to be indexed = Nicolás
  • What we store after icu_folding is applied = nicolas
  • Input request = Nicolás
  • What we would like to search for to match to above document = nicolas

But the plugin runs server side not client side so I don't understand why you would need it in your application.

@dadoonet, I think i may have understood his problem.

Lets consider indexed string "Nicolás" which has token in ES as "nicloas".

A. Now we have a term query "Nicolás". It wont match, as input "Nicolás" is not matching the token "nicolas". This is because ICU filter does not perserve_original like the ASCII folding did. Moreover the input is not converted to "nicolas" while query formation.

B. Another angle:
Now let say we have a match query "Nicolás". It will match the token "nicolas", as when we fire a match query it converts input to "nicolas" on query side, because of the analyzer on ES field. Hence converted input "nicolas" = token "nicolas". Using match could degrade performance (correct me here)

In #A above even if the original is preserved somehow, piyush might want to fire a term query hence he is using the ICU_Analyzer on the input side to convert input "Nicolás" to "nicolas".



I can think of few more cases where indexed data is "Nicolás", token generated is "nicolas".
Now the input is "NicĂłlas". This would never match in case a term query is fired.

Hence the question could be because of manual conversion on input side. Also, say we still want to be in sync with what server side ES version we are using. How do we ensure we use the same ICU analyzer version on client side ? The purpose may not be querying but creating some cleanser table.

I might be wrong but the analyzer you are using at index time is also by default applied at search time.
So I don't understand what the problem is. But I might be wrong.

The index time analyzer is applied at search time only if the query type is "match".
For term query it is not applied. This is our current understanding after kibana profiling.

And that's the case when we have to apply same analyzer (ICU) on search (client) side too.

And that's the case when we have to apply same analyzer (ICU) on search (client) side too.

So use a match query instead of a term query.

That is what we thought initially.
But, we have an algo that requires term query on NotAnalyzed field that has some significance.

Well, we also have another requirement in which we want the icu plugin that we are using in our client code to be in sync with ES's installed plugin. i.e. of version say 6.6.1.
We are using this es-icu-plugin library as an external library for a purpose that is not for querying but icu_folding the text and storing it in a cache.

Then you don't need the Elasticsearch ICU Plugin but an ICU jar to do what ever transformation you need. May be it's lucene-analyzers-icu or icu4j or whatever over lib.

But you don't need this elasticsearch plugin on the client side IMO.

Hi,

Can you check about "_analyze" it can help you to do what you want by requesting elastic to have what ever version of "nicolas" for your client side and prevent you to install on client side.

https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-analyze.html

Yes, absolutely. We were running behind the ES plugin jar because it internally wraps the lucene-analyzers-icu jar.

We have a centralized property where we mention the ES version. Our understanding is that, we could use the same property with a ICU module of ES (if any), so that we always are sure that the lucene jar we are using is of same version on server and client side.

If we use a lucene jar directly we will need to maintain its version on our end. If we upgrade ES version, we will lose the version sync between our lucene jar and the lucene jar which ES is using.

Lets revisit original query:
Is there a way where we can ensure that the lucene jar version we would be using and the lucene jar version which ES is using, can be maintained in sync always ?

Our temporary solution:
We are using below and creating analyzer in our code with ICU folding filter, then we get a TokenStream out of it.

<dependency>
			<groupId>org.elasticsearch.plugin</groupId>
			<artifactId>analysis-icu</artifactId>
			<version>5.0.0-alpha5</version>
</dependency>

Below code:

analyzer = CustomAnalyzer.builder(). //
withTokenizer(KEYWORD). //
addTokenFilter(ICUFoldingFilterFactory.class). //
build();

final TokenStream stream = analyzer.tokenStream(null, new StringReader(inputText));

Problem here:
We have to maintain version=5.0.0-alpha5

Thanks Gabriel. We did check the _analyze as well as termVectors API. Its a REST call and we cannot afford network latency in our performance sensitive application.
We get thousands of requests and a rest call extra for all of them would affect the thourghput.

Hence it was decided to fold icu on the client side.

The thing is that we don't publish anymore pom.xml for plugins which should not be embedded.
That's why you don't see recent versions for it.
I understand that this reduces the "complexity" of maintenance for you.

Another way would be to "just" read this file (https://github.com/elastic/elasticsearch/blob/master/buildSrc/version.properties) anytime you upgrade elasticsearch and add the right Lucene version you need. That should be easy.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.