Phonetic Filter Indexing in Polish

drippel · January 28, 2016, 9:07am

The Phonetic filter provides the option to build an analyzer to index Polish text using the Beider-Morse languageset setting.
For regular Polish indexing I am using the Stempel Polish Analyzer.
How can I use both analyzers to index the same text? As there is no way to customize the Stempel analyzer, is there a way to run two analyzers for indexing the same text at the same time?
I have found the "Combo" plugin, but it looks rather old and does not apply to elasticsearch 2.1.

dadoonet · January 28, 2016, 9:43am

You can index the same text with multiple analyzers using multi fields but you'll have one analyzer per subfield at the end of the day.

If you want to chain analyzers, that's not doable out of the box.

But, you can define your custom analyzer which uses many token filters. So you can probably use the stempel token filter and the phonetic token filter.

drippel · January 28, 2016, 10:08am

Thank you, David.

So suppose I create a Polish phonetic filter like this:

"polish_phonetic_filter" : {
	"languageset" : [
		"english",
		"polish"
	],
	"rule_type" : "approx",
	"type" : "phonetic",
	"encoder" : "beider_morse",
	"name_type" : "generic"
}

and then I create a Polish indexing analyzer like this:

"polish_indexing" : {
	"filter" : [
		"lowercase",
		"stempel",
		"polish_stem",
		"polish_phonetic_filter"
	],
	"char_filter" : [
		"html_strip"
	],
	"type" : "custom",
	"tokenizer" : "stempel"
}

will this analyzer be completely compatible with the Stempel analyzer with the only difference being just the addition of the phonetic filter?
My concern is: do I lose any of the original Stempel plugin functionality (i.e.: are the filters that I specified above the full filter list used originally by Stempel)?

dadoonet · January 28, 2016, 10:31am

I looked at the stempel Lucene source code and found out that it provides the following:

  @Override
  protected TokenStreamComponents createComponents(String fieldName) {
    final Tokenizer source = new StandardTokenizer();
    TokenStream result = new StandardFilter(source);
    result = new LowerCaseFilter(result);
    result = new StopFilter(result, stopwords);
    if(!stemExclusionSet.isEmpty())
      result = new SetKeywordMarkerFilter(result, stemExclusionSet);
    result = new StempelFilter(result, new StempelStemmer(stemTable));
    return new TokenStreamComponents(source, result);
  }

So basically the stempel analyzer is a:

StandardTokenizer
StandardFilter
LowerCaseFilter
StopFilter (with polish stop words)
StempelFilter (with a StempelStemmer)

The plugin exposes:

The stempel analyzer (which is somehow a custom analyzer with all what I wrote above)
The stempel token filter

So yes, I believe you can create something "compatible". That being said, you should know that the analyzer you choose for a field is used at index time and search time.
So I'm wondering why you are asking about "compatibility" here.

I'd give a try to a custom analyzer (use the _analyze API which is fantastic to understand and debug).

drippel · January 28, 2016, 11:40am

Thank you.
By "compatibility" I mean between what the Stempel does and what the "customized" will do: I just want to make sure no functionality is lost.

Topic		Replies	Views
Phonetic search && i18n Elasticsearch	11	1279	July 6, 2017
Custom analysis, phonetic filter and highlighting Elasticsearch	2	439	July 6, 2017
Polish stemming plugin (Stempel 2.4.4) not working on ES 2.4 Elasticsearch	1	472	March 9, 2017
Phonetic Token Filter Issues (ES 2.1.1) Elasticsearch	1	446	July 5, 2017
Is it possible to boost exact (standard) search when phonetic analyzer is used on the field? Elasticsearch	3	526	September 18, 2018

Phonetic Filter Indexing in Polish

Related topics