Phonetic Filter Indexing in Polish


#1

The Phonetic filter provides the option to build an analyzer to index Polish text using the Beider-Morse languageset setting.
For regular Polish indexing I am using the Stempel Polish Analyzer.
How can I use both analyzers to index the same text? As there is no way to customize the Stempel analyzer, is there a way to run two analyzers for indexing the same text at the same time?
I have found the "Combo" plugin, but it looks rather old and does not apply to elasticsearch 2.1.


(David Pilato) #2

You can index the same text with multiple analyzers using multi fields but you'll have one analyzer per subfield at the end of the day.

If you want to chain analyzers, that's not doable out of the box.

But, you can define your custom analyzer which uses many token filters. So you can probably use the stempel token filter and the phonetic token filter.


#3

Thank you, David.

So suppose I create a Polish phonetic filter like this:

"polish_phonetic_filter" : {
	"languageset" : [
		"english",
		"polish"
	],
	"rule_type" : "approx",
	"type" : "phonetic",
	"encoder" : "beider_morse",
	"name_type" : "generic"
}

and then I create a Polish indexing analyzer like this:

"polish_indexing" : {
	"filter" : [
		"lowercase",
		"stempel",
		"polish_stem",
		"polish_phonetic_filter"
	],
	"char_filter" : [
		"html_strip"
	],
	"type" : "custom",
	"tokenizer" : "stempel"
}

will this analyzer be completely compatible with the Stempel analyzer with the only difference being just the addition of the phonetic filter?
My concern is: do I lose any of the original Stempel plugin functionality (i.e.: are the filters that I specified above the full filter list used originally by Stempel)?


(David Pilato) #4

I looked at the stempel Lucene source code and found out that it provides the following:

  @Override
  protected TokenStreamComponents createComponents(String fieldName) {
    final Tokenizer source = new StandardTokenizer();
    TokenStream result = new StandardFilter(source);
    result = new LowerCaseFilter(result);
    result = new StopFilter(result, stopwords);
    if(!stemExclusionSet.isEmpty())
      result = new SetKeywordMarkerFilter(result, stemExclusionSet);
    result = new StempelFilter(result, new StempelStemmer(stemTable));
    return new TokenStreamComponents(source, result);
  }

So basically the stempel analyzer is a:

  • StandardTokenizer
  • StandardFilter
  • LowerCaseFilter
  • StopFilter (with polish stop words)
  • StempelFilter (with a StempelStemmer)

The plugin exposes:

  • The stempel analyzer (which is somehow a custom analyzer with all what I wrote above)
  • The stempel token filter

So yes, I believe you can create something "compatible". That being said, you should know that the analyzer you choose for a field is used at index time and search time.
So I'm wondering why you are asking about "compatibility" here.

I'd give a try to a custom analyzer (use the _analyze API which is fantastic to understand and debug).


#5

Thank you.
By "compatibility" I mean between what the Stempel does and what the "customized" will do: I just want to make sure no functionality is lost.


(system) #6