Using the Snowball stemmers


(Joaquin Cuenca Abela) #1

Hi,

I want to index some spanish text, and it seems that I need to use
some external stemmer to do this, as ES/Lucene doesn't include by
default any stemmer for spanish.

Is there any docs on how to use an external stemmer (for instance the
Snowball ones) with ElasticSearch?

Thanks!

--
Joaquin Cuenca Abela


(Sebastian Gavarini) #2

Hi Joaquin,

If I remember correctly Lucene includes the Snowball family of
stemmers in it's contrib package. PorterStemmer is included in ES but
it's English only. I'll try to draft an exmaple for you, I would
suggest that you look at how PorterStemmer is included in ES and
create similar classes to use Lucene's Snowball, take a look at the
class:

  • PorterStemTokenFilterFactory: it's the factory responsible of
    creating and configuring the stemmer

Create a similar factory, eg: SnowballTokenFilterFactory

package org.elasticsearch.index.analysis;
imports....
public class SnowballTokenFilterFactory extends
AbstractTokenFilterFactory {

private String language;

@Inject public SnowballTokenFilterFactory(Index index,

@IndexSettings Settings indexSettings, @Assisted String name,
@Assisted Settings settings) {
super(index, indexSettings, name);
this.language = settings.get("language");
}

@Override public TokenStream create(TokenStream tokenStream) {
    return new SnowballFilter(tokenStream, language);
}

}

I haven't tried it, but it should work pretty much like that, with the
correct imports. It's important (or at least it was when I looked into
it sometime ago) to use the package "org.elasticsearch.*" for these
factories. Then the variable "language" should appear in
elasticsearch.yml, like:

index:
analysis:
analyzer:
my_analyzer:
type: custom
tokenizer: whitespace
filter: [lowercase, asciifolding, snowball]
filter:
snowball:
type:
org.elasticsearch.index.analysis.SnowballTokenFilterFactory
language: Spanish

You need to include your custom classes, in this case
SnowballTokenFilterFactory, in a jar in the lib directory of ES.

Please give it a try and let me know if you have some problems.

Regards,
Sebastian.

On Dec 21, 11:03 am, Joaquin Cuenca Abela joaq...@cuencaabela.com
wrote:

Hi,

I want to index some spanish text, and it seems that I need to use
some external stemmer to do this, as ES/Lucene doesn't include by
default any stemmer for spanish.

Is there any docs on how to use an external stemmer (for instance the
Snowball ones) with ElasticSearch?

Thanks!

--
Joaquin Cuenca Abela


(system) #3