However, I think there may be a bug here because I think "porter2" may just
map to the porter stemmer. I tried stemming a word both on the porter and
porter2 stemmers. Both stemmed the word "stayed" to "stai". That is the
correct result for the porter stemmer, but it is the incorrect result for
the porter2 stemmer. I verified this using the python stemmer library.
According to that library, porter stems "stayed" to "stai" and porter2
stems "stayed" to "stay".
So I took a look into the code and I found the following
in StemmerTokenFilterFactory.java:
...
} else if ("porter".equalsIgnoreCase(language)) {
return new PorterStemFilter(tokenStream);
} else if ("porter2".equalsIgnoreCase(language)) {
return new SnowballFilter(tokenStream, new PorterStemmer());
...
Notice that in both cases a Porter stemmer is instantiated, not a porter2
stemmer. Any thoughts on why this is not a bug?
However, I think there may be a bug here because I think "porter2" may
just map to the porter stemmer. I tried stemming a word both on the porter
and porter2 stemmers. Both stemmed the word "stayed" to "stai". That is the
correct result for the porter stemmer, but it is the incorrect result for
the porter2 stemmer. I verified this using the python stemmer library.
According to that library, porter stems "stayed" to "stai" and porter2
stems "stayed" to "stay".
So I took a look into the code and I found the following
in StemmerTokenFilterFactory.java:
...
} else if ("porter".equalsIgnoreCase(language)) {
return new PorterStemFilter(tokenStream);
} else if ("porter2".equalsIgnoreCase(language)) {
return new SnowballFilter(tokenStream, new PorterStemmer());
...
Notice that in both cases a Porter stemmer is instantiated, not a porter2
stemmer. Any thoughts on why this is not a bug?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.