Multi field custom elasticsearch analyzer

I am creating an Elasticsearch plugin. I want to analyze a url in two different ways using the same analyzer that is pattern analyzer but with different regular expressions for getting different tokens. I have started coding the plugin. I have created three files: AnalyzerProvider.java, tokenizerFactory.java and Plugin.java. Below is the code in each file:

Code in AnalyzerProvider file:

package pl.allegro.tech.elasticsearch.index.analysis.pl;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.CharArraySet;
import org.elasticsearch.common.regex.Regex;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.env.Environment;
import org.elasticsearch.index.IndexSettings;
import org.elasticsearch.index.analysis.AbstractIndexAnalyzerProvider;
import org.elasticsearch.index.analysis.Analysis;
import org.elasticsearch.index.analysis.PatternAnalyzer;

import java.util.regex.Pattern;

public class MorfologikAnalyzerProvider extends AbstractIndexAnalyzerProvider<Analyzer> {

    private final PatternAnalyzer analyzer;

    public MorfologikAnalyzerProvider(IndexSettings indexSettings, Environment env, String name, Settings settings) {
        super(indexSettings, name, settings);

        final CharArraySet defaultStopwords = CharArraySet.EMPTY_SET;
        boolean lowercase = settings.getAsBoolean("lowercase", true);
        CharArraySet stopWords = Analysis.parseStopWords(env, settings, defaultStopwords);

        String sPattern = settings.get("pattern", "\\W+" /*PatternAnalyzer.NON_WORD_PATTERN*/);
        if (sPattern == null) {
            throw new IllegalArgumentException("Analyzer [" + name + "] of type pattern must have a `pattern` set");
        }
        Pattern pattern = Regex.compile(sPattern, settings.get("flags"));

        analyzer = new PatternAnalyzer(pattern, lowercase, stopWords);
    }

    @Override
    public PatternAnalyzer get() {
        return analyzer;
    }
}







Below is the code in TokenizerFactory file:

package pl.allegro.tech.elasticsearch.index.analysis.pl;

import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.pattern.PatternTokenizer;
import org.elasticsearch.common.regex.Regex;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.env.Environment;
import org.elasticsearch.index.IndexSettings;
import org.elasticsearch.index.analysis.AbstractTokenizerFactory;

import java.util.regex.Pattern;

public class UrlTokenizerFactory extends AbstractTokenizerFactory {

    private final Pattern pattern;
    private final int group;

    public UrlTokenizerFactory(IndexSettings indexSettings, Environment environment, String name, Settings settings) {
        super(indexSettings, name, settings);

        String sPattern = settings.get("pattern", "\\W+" /*PatternAnalyzer.NON_WORD_PATTERN*/);
        if (sPattern == null) {
            throw new IllegalArgumentException("pattern is missing for [" + name + "] tokenizer of type 'pattern'");
        }

        this.pattern = Regex.compile(sPattern, settings.get("flags"));
        this.group = settings.getAsInt("group", -1);
    }

    @Override
    public Tokenizer create() {
        return new PatternTokenizer(pattern, group);
    }
}

Below is the code inside the plugin.java file:

package pl.allegro.tech.elasticsearch.plugin.analysis.morfologik;

import org.apache.lucene.analysis.Analyzer;
import org.elasticsearch.index.analysis.AnalyzerProvider;
import org.elasticsearch.index.analysis.TokenizerFactory;
import org.elasticsearch.indices.analysis.AnalysisModule.AnalysisProvider;
import org.elasticsearch.plugins.AnalysisPlugin;
import org.elasticsearch.plugins.Plugin;
import pl.allegro.tech.elasticsearch.index.analysis.pl.MorfologikAnalyzerProvider;
import pl.allegro.tech.elasticsearch.index.analysis.pl.UrlTokenizerFactory;

import java.util.HashMap;
import java.util.Map;

import static java.util.Collections.singletonMap;

public class AnalysisMorfologikPlugin extends Plugin implements AnalysisPlugin {

    public static final String ANALYZER_NAME = "url_analyzer";
    public static final String TOKENIZER_NAME = "url_tokenizer";

    @Override
    public Map<String, AnalysisProvider<TokenizerFactory>> getTokenizers() {
        Map<String, AnalysisProvider<TokenizerFactory>> extra = new HashMap<>();
        extra.put(TOKENIZER_NAME, UrlTokenizerFactory::new);
        return extra;
    }


    @Override
    public Map<String, AnalysisProvider<AnalyzerProvider<? extends Analyzer>>> getAnalyzers() {
        return singletonMap(ANALYZER_NAME, MorfologikAnalyzerProvider::new);
    }
}


I have done some research that in order to use the multifields, I have to use the perFieldAnaylzerWrapper in lucene. My question is that where do I have to write the code for perFieldAnalyzerWrapper and how can I do it. I have also seen some example code where they were also creating and indexing the document alongwith using perfieldAnalyzerWrapper. Do I have to also write the document indexing code? If yes, then in which file?

Is there anyone who has worked on similar project? Any help will be highly appreciated.

I've worked on a similar project where we used a custom analyzer in our plugin, we did not use analyzerfactory so I do not know how to add on this but, we used the high level rest client which has analyzer, term vector etc API's come with it. I don't know if this is helpful but maybe you can give that a try;

https://www.elastic.co/guide/en/elasticsearch/client/java-rest/current/java-rest-high-analyze.html

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.