I am creating an Elasticsearch plugin. I want to analyze a url in two different ways using the same analyzer that is pattern analyzer but with different regular expressions for getting different tokens. I have started coding the plugin. I have created three files: AnalyzerProvider.java, tokenizerFactory.java and Plugin.java. Below is the code in each file:
Code in AnalyzerProvider file:
package pl.allegro.tech.elasticsearch.index.analysis.pl;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.CharArraySet;
import org.elasticsearch.common.regex.Regex;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.env.Environment;
import org.elasticsearch.index.IndexSettings;
import org.elasticsearch.index.analysis.AbstractIndexAnalyzerProvider;
import org.elasticsearch.index.analysis.Analysis;
import org.elasticsearch.index.analysis.PatternAnalyzer;
import java.util.regex.Pattern;
public class MorfologikAnalyzerProvider extends AbstractIndexAnalyzerProvider<Analyzer> {
private final PatternAnalyzer analyzer;
public MorfologikAnalyzerProvider(IndexSettings indexSettings, Environment env, String name, Settings settings) {
super(indexSettings, name, settings);
final CharArraySet defaultStopwords = CharArraySet.EMPTY_SET;
boolean lowercase = settings.getAsBoolean("lowercase", true);
CharArraySet stopWords = Analysis.parseStopWords(env, settings, defaultStopwords);
String sPattern = settings.get("pattern", "\\W+" /*PatternAnalyzer.NON_WORD_PATTERN*/);
if (sPattern == null) {
throw new IllegalArgumentException("Analyzer [" + name + "] of type pattern must have a `pattern` set");
}
Pattern pattern = Regex.compile(sPattern, settings.get("flags"));
analyzer = new PatternAnalyzer(pattern, lowercase, stopWords);
}
@Override
public PatternAnalyzer get() {
return analyzer;
}
}
Below is the code in TokenizerFactory file:
package pl.allegro.tech.elasticsearch.index.analysis.pl;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.pattern.PatternTokenizer;
import org.elasticsearch.common.regex.Regex;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.env.Environment;
import org.elasticsearch.index.IndexSettings;
import org.elasticsearch.index.analysis.AbstractTokenizerFactory;
import java.util.regex.Pattern;
public class UrlTokenizerFactory extends AbstractTokenizerFactory {
private final Pattern pattern;
private final int group;
public UrlTokenizerFactory(IndexSettings indexSettings, Environment environment, String name, Settings settings) {
super(indexSettings, name, settings);
String sPattern = settings.get("pattern", "\\W+" /*PatternAnalyzer.NON_WORD_PATTERN*/);
if (sPattern == null) {
throw new IllegalArgumentException("pattern is missing for [" + name + "] tokenizer of type 'pattern'");
}
this.pattern = Regex.compile(sPattern, settings.get("flags"));
this.group = settings.getAsInt("group", -1);
}
@Override
public Tokenizer create() {
return new PatternTokenizer(pattern, group);
}
}
Below is the code inside the plugin.java file:
package pl.allegro.tech.elasticsearch.plugin.analysis.morfologik;
import org.apache.lucene.analysis.Analyzer;
import org.elasticsearch.index.analysis.AnalyzerProvider;
import org.elasticsearch.index.analysis.TokenizerFactory;
import org.elasticsearch.indices.analysis.AnalysisModule.AnalysisProvider;
import org.elasticsearch.plugins.AnalysisPlugin;
import org.elasticsearch.plugins.Plugin;
import pl.allegro.tech.elasticsearch.index.analysis.pl.MorfologikAnalyzerProvider;
import pl.allegro.tech.elasticsearch.index.analysis.pl.UrlTokenizerFactory;
import java.util.HashMap;
import java.util.Map;
import static java.util.Collections.singletonMap;
public class AnalysisMorfologikPlugin extends Plugin implements AnalysisPlugin {
public static final String ANALYZER_NAME = "url_analyzer";
public static final String TOKENIZER_NAME = "url_tokenizer";
@Override
public Map<String, AnalysisProvider<TokenizerFactory>> getTokenizers() {
Map<String, AnalysisProvider<TokenizerFactory>> extra = new HashMap<>();
extra.put(TOKENIZER_NAME, UrlTokenizerFactory::new);
return extra;
}
@Override
public Map<String, AnalysisProvider<AnalyzerProvider<? extends Analyzer>>> getAnalyzers() {
return singletonMap(ANALYZER_NAME, MorfologikAnalyzerProvider::new);
}
}
I have done some research that in order to use the multifields, I have to use the perFieldAnaylzerWrapper in lucene. My question is that where do I have to write the code for perFieldAnalyzerWrapper and how can I do it. I have also seen some example code where they were also creating and indexing the document alongwith using perfieldAnalyzerWrapper. Do I have to also write the document indexing code? If yes, then in which file?