Standard TokenizerFactory deprecated in Elasticsearch Java API v7?

munasia · March 12, 2020, 11:45pm

Currently performing an Elasticsearch upgrade from v1.7 to v7.1. While refactoring our Java code, I came across this Elasticsearch v1.7 code:

    private static final TokenizerFactory STANDARD_TOKENIZER_FACTORY = 
        PreBuiltTokenizers.STANDARD.getTokenizerFactory(Version.V_1_7_3);

    org.apache.lucene.analysis.Tokenizer tokenizer = 
            STANDARD_TOKENIZER_FACTORY.create(new StringReader(terms.toLowerCase()));

Can this be recreated using the Java Elasticsearch 7.1 APIs? I've tried to no avail. Looks likePreBuiltTokenizers.STANDARD.getTokenizerFactory() has been deprecated in v7.

spinscale · March 13, 2020, 4:43pm

Hey,

can you maybe explain what you are trying to do and why you need access to those classes? Are you writing a plugin like an AnalysisPlugin?

--Alex

munasia · March 13, 2020, 8:28pm

In our Java code, we tokenize search terms then we categorize the tokens into 2 lists: high frequency tokens (like stopwords) and low frequency tokens. Then, we build a compound query and send it to Elasticsearch. In the query, the 2 frequencies have different settings applied to them (e.g different boosts etc). This is legacy code.

For example if the user searches for "the quick and fast brown fox", we'll break that up into ["the and"] and ["quick fast brown"], then create a query (that boosts low frequency tokens different than high frequency tokens) and send it to Elasticsearch.

This is just one example. That tokenizing logic that I have in my original post is in a common util method that is used in multiple places for the purpose of tokenizing strings.

spinscale · March 16, 2020, 1:50pm

You may want to read about the common terms query - but even more important, why it is deprecated in newer versions, that might help you to rethink your whole architecture in that regard.

system · April 13, 2020, 1:50pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Wordpiece tokenizer Elasticsearch	4	545	March 28, 2022
ES Plugin to extend Lucene's Standard Tokenizer Elasticsearch	5	895	July 6, 2017
Elastic Search Tokenizer (for tf-idf) Elasticsearch	8	763	July 6, 2017
Which Tokenizer to use Elasticsearch	1	470	July 5, 2017
Elasticsearch 6.4.x does not register keyword tokenizer Elasticsearch	5	1444	December 18, 2018

Standard TokenizerFactory deprecated in Elasticsearch Java API v7?

Related topics