Multi field custom elasticsearch analyzer

dishant.sharma · January 17, 2022, 5:35pm

I am creating an Elasticsearch plugin. I want to analyze a url in two different ways using the same analyzer that is pattern analyzer but with different regular expressions for getting different tokens. I have started coding the plugin. I have created three files: AnalyzerProvider.java, tokenizerFactory.java and Plugin.java. Below is the code in each file:

Code in AnalyzerProvider file:

package pl.allegro.tech.elasticsearch.index.analysis.pl;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.CharArraySet;
import org.elasticsearch.common.regex.Regex;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.env.Environment;
import org.elasticsearch.index.IndexSettings;
import org.elasticsearch.index.analysis.AbstractIndexAnalyzerProvider;
import org.elasticsearch.index.analysis.Analysis;
import org.elasticsearch.index.analysis.PatternAnalyzer;

import java.util.regex.Pattern;

public class MorfologikAnalyzerProvider extends AbstractIndexAnalyzerProvider<Analyzer> {

    private final PatternAnalyzer analyzer;

    public MorfologikAnalyzerProvider(IndexSettings indexSettings, Environment env, String name, Settings settings) {
        super(indexSettings, name, settings);

        final CharArraySet defaultStopwords = CharArraySet.EMPTY_SET;
        boolean lowercase = settings.getAsBoolean("lowercase", true);
        CharArraySet stopWords = Analysis.parseStopWords(env, settings, defaultStopwords);

        String sPattern = settings.get("pattern", "\\W+" /*PatternAnalyzer.NON_WORD_PATTERN*/);
        if (sPattern == null) {
            throw new IllegalArgumentException("Analyzer [" + name + "] of type pattern must have a `pattern` set");
        }
        Pattern pattern = Regex.compile(sPattern, settings.get("flags"));

        analyzer = new PatternAnalyzer(pattern, lowercase, stopWords);
    }

    @Override
    public PatternAnalyzer get() {
        return analyzer;
    }
}

Below is the code in TokenizerFactory file:

package pl.allegro.tech.elasticsearch.index.analysis.pl;

import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.pattern.PatternTokenizer;
import org.elasticsearch.common.regex.Regex;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.env.Environment;
import org.elasticsearch.index.IndexSettings;
import org.elasticsearch.index.analysis.AbstractTokenizerFactory;

import java.util.regex.Pattern;

public class UrlTokenizerFactory extends AbstractTokenizerFactory {

    private final Pattern pattern;
    private final int group;

    public UrlTokenizerFactory(IndexSettings indexSettings, Environment environment, String name, Settings settings) {
        super(indexSettings, name, settings);

        String sPattern = settings.get("pattern", "\\W+" /*PatternAnalyzer.NON_WORD_PATTERN*/);
        if (sPattern == null) {
            throw new IllegalArgumentException("pattern is missing for [" + name + "] tokenizer of type 'pattern'");
        }

        this.pattern = Regex.compile(sPattern, settings.get("flags"));
        this.group = settings.getAsInt("group", -1);
    }

    @Override
    public Tokenizer create() {
        return new PatternTokenizer(pattern, group);
    }
}

Below is the code inside the plugin.java file:

package pl.allegro.tech.elasticsearch.plugin.analysis.morfologik;

import org.apache.lucene.analysis.Analyzer;
import org.elasticsearch.index.analysis.AnalyzerProvider;
import org.elasticsearch.index.analysis.TokenizerFactory;
import org.elasticsearch.indices.analysis.AnalysisModule.AnalysisProvider;
import org.elasticsearch.plugins.AnalysisPlugin;
import org.elasticsearch.plugins.Plugin;
import pl.allegro.tech.elasticsearch.index.analysis.pl.MorfologikAnalyzerProvider;
import pl.allegro.tech.elasticsearch.index.analysis.pl.UrlTokenizerFactory;

import java.util.HashMap;
import java.util.Map;

import static java.util.Collections.singletonMap;

public class AnalysisMorfologikPlugin extends Plugin implements AnalysisPlugin {

    public static final String ANALYZER_NAME = "url_analyzer";
    public static final String TOKENIZER_NAME = "url_tokenizer";

    @Override
    public Map<String, AnalysisProvider<TokenizerFactory>> getTokenizers() {
        Map<String, AnalysisProvider<TokenizerFactory>> extra = new HashMap<>();
        extra.put(TOKENIZER_NAME, UrlTokenizerFactory::new);
        return extra;
    }


    @Override
    public Map<String, AnalysisProvider<AnalyzerProvider<? extends Analyzer>>> getAnalyzers() {
        return singletonMap(ANALYZER_NAME, MorfologikAnalyzerProvider::new);
    }
}

I have done some research that in order to use the multifields, I have to use the perFieldAnaylzerWrapper in lucene. My question is that where do I have to write the code for perFieldAnalyzerWrapper and how can I do it. I have also seen some example code where they were also creating and indexing the document alongwith using perfieldAnalyzerWrapper. Do I have to also write the document indexing code? If yes, then in which file?

dishant.sharma · January 18, 2022, 3:47pm

Is there anyone who has worked on similar project? Any help will be highly appreciated.

can.ozdemir · January 21, 2022, 8:54am

I've worked on a similar project where we used a custom analyzer in our plugin, we did not use analyzerfactory so I do not know how to add on this but, we used the high level rest client which has analyzer, term vector etc API's come with it. I don't know if this is helpful but maybe you can give that a try;

https://www.elastic.co/guide/en/elasticsearch/client/java-rest/current/java-rest-high-analyze.html

system · February 18, 2022, 8:54am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Use per field analyzer wrapper in elasticsearch Elasticsearch	1	207	April 13, 2022
How to add per-field and customized analyzer to elastic search Elasticsearch	3	307	July 6, 2017
Specifying analyzer on a per field basis at index time Elasticsearch	6	420	July 6, 2017
Combining Analyzer/Tokenizer in one Elasticsearch	5	407	July 6, 2017
Elasticsearch analyzer Elasticsearch	4	828	March 28, 2017

Multi field custom elasticsearch analyzer

Related topics