A less aggressive stemming token filter that strips only plural

Sorostaran · March 4, 2011, 6:05pm

Does anybody have a token filter (a cut-down Porter stemmer or something) that only does plural stemming in English for ElasticSearch? That seems like a common need for which few databases have out-of-the-box solutions. Any chance of adding ispell as a token filter?

rmuir · March 4, 2011, 6:47pm

On Fri, Mar 4, 2011 at 1:05 PM, Sorostaran steven@datafeedfile.com wrote:

Does anybody have a token filter (a cut-down Porter stemmer or something)
that only does plural stemming in English for Elasticsearch? That seems like
a common need for which few databases have out-of-the-box solutions. Any
chance of adding ispell as a token filter?

Hi, in the upcoming lucene 3.1 there will be a variety of plural-only
and lighter implementations for at least common european languages,
you can see those here:
http://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x/lucene/contrib/analyzers/common/src/java/org/apache/lucene/analysis/

additionally there is the capability to override all stemmers
(including these plural-only and lighter ones) e.g. by specifying
exceptions (that should be mapped to some special form:
http://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x/lucene/contrib/analyzers/common/src/java/org/apache/lucene/analysis/miscellaneous/StemmerOverrideFilter.java)
or words they should be ignored totally
[http://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x/lucene/src/java/org/apache/lucene/analysis/KeywordMarkerFilter.java]

in my opinion this is the ideal way to go for many apps... start with
something very minimal like plural-only and add exceptions for stuff
that makes sense for your domain (e.g. "fatigues" is not the plural of
"fatigue" in english).

kimchy · March 5, 2011, 6:58am

Heya,

Thanks Robert!. Many of the specific language analyzers are already exposed in elasticsearch based on Lucene 3.0.3. Once 3.1 is released, all of these will be exposed as well.

-shay.banon
On Friday, March 4, 2011 at 8:47 PM, Robert Muir wrote:

On Fri, Mar 4, 2011 at 1:05 PM, Sorostaran steven@datafeedfile.com wrote:

Does anybody have a token filter (a cut-down Porter stemmer or something)
that only does plural stemming in English for Elasticsearch? That seems like
a common need for which few databases have out-of-the-box solutions. Any
chance of adding ispell as a token filter?

Hi, in the upcoming lucene 3.1 there will be a variety of plural-only
and lighter implementations for at least common european languages,
you can see those here:
http://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x/lucene/contrib/analyzers/common/src/java/org/apache/lucene/analysis/

additionally there is the capability to override all stemmers
(including these plural-only and lighter ones) e.g. by specifying
exceptions (that should be mapped to some special form:
http://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x/lucene/contrib/analyzers/common/src/java/org/apache/lucene/analysis/miscellaneous/StemmerOverrideFilter.java)
or words they should be ignored totally
[http://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x/lucene/src/java/org/apache/lucene/analysis/KeywordMarkerFilter.java]

in my opinion this is the ideal way to go for many apps... start with
something very minimal like plural-only and add exceptions for stuff
that makes sense for your domain (e.g. "fatigues" is not the plural of
"fatigue" in english).

rmuir · March 5, 2011, 5:58pm

On Sat, Mar 5, 2011 at 1:58 AM, Shay Banon shay.banon@elasticsearch.com wrote:

Heya,
Thanks Robert!. Many of the specific language analyzers are already
exposed in elasticsearch based on Lucene 3.0.3. Once 3.1 is released, all of
these will be exposed as well.
-shay.banon

Hey, one question. Do elasticsearch users "typically" use the lucene
Analyzer classes or do they construct them "on-the-fly" from
tokenstreams? (I think i've seen use of "custom" for this?)

The reason i say this is that most of the actual Analyzer classes just
use the "heavy-duty" snowball stuff... even if more reasonable
alternatives are available.

just wondering if in 3.2 it would be worth our effort to consider
improving these Analyzers, e.g. default them to less aggressive
stemmers... if they are being used for more than just examples

kimchy · March 6, 2011, 4:12am

On Saturday, March 5, 2011 at 7:58 PM, Robert Muir wrote:
On Sat, Mar 5, 2011 at 1:58 AM, Shay Banon shay.banon@elasticsearch.com wrote:

Heya,
Thanks Robert!. Many of the specific language analyzers are already
exposed in elasticsearch based on Lucene 3.0.3. Once 3.1 is released, all of
these will be exposed as well.
-shay.banon

Hey, one question. Do elasticsearch users "typically" use the lucene
Analyzer classes or do they construct them "on-the-fly" from
tokenstreams? (I think i've seen use of "custom" for this?)

The reason i say this is that most of the actual Analyzer classes just
use the "heavy-duty" snowball stuff... even if more reasonable
alternatives are available.

just wondering if in 3.2 it would be worth our effort to consider
improving these Analyzers, e.g. default them to less aggressive
stemmers... if they are being used for more than just examples
Its got both, exposing the pre built analyzers that comes out of the box, and the ability to create custom ones (which include a tokenizers and one or more filters). Definitely, the "default" ones are probably more popular, since its much simpler to configure.

The default analyzers are certainly used for more than just examples, simply because its much simpler to use and not many users (initially) go to the depth of understanding and configuring their own analyzers. So, I would say a better out of the box analyzers would go a long way. It will certainly be hte case for pure Lucene, and possibly it can be done in elasticsearch by exposing more pre built analyzers.

linsms · May 2, 2011, 12:18pm

Hi,

I'm newbie in lucene (I'm using 3.1) and I'm trying to use SpanishAnalyzer
to make a query, but I get unspected results: I get cut querys with terms
like "despues" or "ciempies" (body:despu and body:ciempi).

I understand that with your explanation I can disable the plural's cut, but
I don't know how to do it.

Could you help me?

Thanks in advance.

--
View this message in context: http://elasticsearch-users.115913.n3.nabble.com/A-less-aggressive-stemming-token-filter-that-strips-only-plural-tp2634846p2889629.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

otisg · May 5, 2011, 3:29pm

linsms,

I think Robert just addressed this a day earlier in this same thread:
http://groups.google.com/a/elasticsearch.com/group/users/browse_thread/thread/97eb258f5fbbdc69

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/

On May 2, 8:17 am, linsms lin...@gmail.com wrote:

Hi,

I'm newbie in lucene (I'm using 3.1) and I'm trying to use SpanishAnalyzer
to make a query, but I get unspected results: I get cut querys with terms
like "despues" or "ciempies" (body:despu and body:ciempi).

I understand that with your explanation I can disable the plural's cut, but
I don't know how to do it.

Could you help me?

Thanks in advance.

--
View this message in context:http://elasticsearch-users.115913.n3.nabble.com/A-less-aggressive-ste...
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

rmuir · May 5, 2011, 3:37pm

On Thu, May 5, 2011 at 11:29 AM, Otis otis.gospodnetic@gmail.com wrote:

linsms,

I think Robert just addressed this a day earlier in this same thread:
http://groups.google.com/a/elasticsearch.com/group/users/browse_thread/thread/97eb258f5fbbdc69

Thanks Otis, as far as elasticsearch, now that its on 3.1 one easy win
might be to expose factories for some of these filters (if not
already)?
In combination with factories for the *LightStemFilter and
*MinimalFilters, i would also recommend exposing the new
StemmerOverrideFilter and KeywordMarkerFilter.

This way, users can pick less aggressive algorithms and then tune any
exceptions to fit.

kimchy · May 5, 2011, 6:05pm

Heya Robert,

Thanks!, yes, I should expose those as well as built in options. Here is the issue: Analysis: Expose light and minimal language token filters · Issue #908 · elastic/elasticsearch · GitHub.

-shay.banon
On Thursday, May 5, 2011 at 6:37 PM, Robert Muir wrote:

On Thu, May 5, 2011 at 11:29 AM, Otis otis.gospodnetic@gmail.com wrote:

linsms,

I think Robert just addressed this a day earlier in this same thread:
http://groups.google.com/a/elasticsearch.com/group/users/browse_thread/thread/97eb258f5fbbdc69

Thanks Otis, as far as elasticsearch, now that its on 3.1 one easy win
might be to expose factories for some of these filters (if not
already)?
In combination with factories for the *LightStemFilter and
*MinimalFilters, i would also recommend exposing the new
StemmerOverrideFilter and KeywordMarkerFilter.

This way, users can pick less aggressive algorithms and then tune any
exceptions to fit.

Lukas_Vlcek1 · May 5, 2011, 6:40pm

This will be really useful, can't wait to see it available.

On Thu, May 5, 2011 at 8:05 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Heya Robert,

Thanks!, yes, I should expose those as well as built in options. Here is
the issue: Analysis: Expose light and minimal language token filters · Issue #908 · elastic/elasticsearch · GitHub.

-shay.banon

On Thursday, May 5, 2011 at 6:37 PM, Robert Muir wrote:

On Thu, May 5, 2011 at 11:29 AM, Otis otis.gospodnetic@gmail.com wrote:

linsms,

I think Robert just addressed this a day earlier in this same thread:

http://groups.google.com/a/elasticsearch.com/group/users/browse_thread/thread/97eb258f5fbbdc69

Thanks Otis, as far as elasticsearch, now that its on 3.1 one easy win
might be to expose factories for some of these filters (if not
already)?
In combination with factories for the *LightStemFilter and
*MinimalFilters, i would also recommend exposing the new
StemmerOverrideFilter and KeywordMarkerFilter.

This way, users can pick less aggressive algorithms and then tune any
exceptions to fit.

tfreitas · May 5, 2011, 9:40pm

Hi Shay

In Code
https://github.com/elasticsearch/elasticsearch/blob/master/modules/elasticsearch/src/main/java/org/elasticsearch/index/analysis/SpanishAnalyzerProvider.java

@Inject public SpanishAnalyzerProvider(Index index, @IndexSettings

Settings indexSettings, @Assisted String name, @Assisted Settings
settings) {
super(index, indexSettings, name, settings);
analyzer = new SpanishAnalyzer(version,
Analysis.parseStopWords(settings,
ArabicAnalyzer.getDefaultStopSet()),
Analysis.parseStemExclusion(settings,
CharArraySet.EMPTY_SET));
}

ArabicAnalyzer.getDefaultStopSet() is Ok?

On May 5, 2:05 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Heya Robert,

Thanks!, yes, I should expose those as well as built in options. Here is the issue:Analysis: Expose light and minimal language token filters · Issue #908 · elastic/elasticsearch · GitHub.

-shay.banon

On Thursday, May 5, 2011 at 6:37 PM, Robert Muir wrote:

On Thu, May 5, 2011 at 11:29 AM, Otis otis.gospodne...@gmail.com wrote:

linsms,

I think Robert just addressed this a day earlier in this same thread:
http://groups.google.com/a/elasticsearch.com/group/users/browse_threa...

Thanks Otis, as far as elasticsearch, now that its on 3.1 one easy win
might be to expose factories for some of these filters (if not
already)?
In combination with factories for the *LightStemFilter and
*MinimalFilters, i would also recommend exposing the new
StemmerOverrideFilter and KeywordMarkerFilter.

This way, users can pick less aggressive algorithms and then tune any
exceptions to fit.

kimchy · May 5, 2011, 9:47pm

Ha, saw it as well while trying to add custom "handlers" for default lang stopwords to stop filter. Pushed a fix.
On Friday, May 6, 2011 at 12:40 AM, tfreitas wrote:

Hi Shay

In Code
https://github.com/elasticsearch/elasticsearch/blob/master/modules/elasticsearch/src/main/java/org/elasticsearch/index/analysis/SpanishAnalyzerProvider.java

@Inject public SpanishAnalyzerProvider(Index index, @IndexSettings
Settings indexSettings, @Assisted String name, @Assisted Settings
settings) {
super(index, indexSettings, name, settings);
analyzer = new SpanishAnalyzer(version,
Analysis.parseStopWords(settings,
ArabicAnalyzer.getDefaultStopSet()),
Analysis.parseStemExclusion(settings,
CharArraySet.EMPTY_SET));
}

ArabicAnalyzer.getDefaultStopSet() is Ok?

On May 5, 2:05 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Heya Robert,

Thanks!, yes, I should expose those as well as built in options. Here is the issue:Analysis: Expose light and minimal language token filters · Issue #908 · elastic/elasticsearch · GitHub.

-shay.banon

On Thursday, May 5, 2011 at 6:37 PM, Robert Muir wrote:

On Thu, May 5, 2011 at 11:29 AM, Otis otis.gospodne...@gmail.com wrote:

linsms,

I think Robert just addressed this a day earlier in this same thread:
http://groups.google.com/a/elasticsearch.com/group/users/browse_threa...

Thanks Otis, as far as elasticsearch, now that its on 3.1 one easy win
might be to expose factories for some of these filters (if not
already)?
In combination with factories for the *LightStemFilter and
*MinimalFilters, i would also recommend exposing the new
StemmerOverrideFilter and KeywordMarkerFilter.

This way, users can pick less aggressive algorithms and then tune any
exceptions to fit.

Topic		Replies	Views
New language - Custom analyzer plugin or token filter Elasticsearch	1	541	March 21, 2017
Stemming acronyms ending in "s"; keyword marker token filter; minimal english stemmer Elasticsearch	3	739	July 6, 2017
Differences between light_spanish and spanish stemmers Elasticsearch	2	1021	September 1, 2021
Alternatives to stemming for plural search term match Elasticsearch	2	280	July 27, 2022
German stemmer - looking for snowball alternative Elasticsearch	5	2000	July 6, 2017

A less aggressive stemming token filter that strips only plural

Otis

Settings indexSettings, @Assisted String name, @Assisted Settings settings) { super(index, indexSettings, name, settings); analyzer = new SpanishAnalyzer(version, Analysis.parseStopWords(settings, ArabicAnalyzer.getDefaultStopSet()), Analysis.parseStemExclusion(settings, CharArraySet.EMPTY_SET)); }

Related topics

Settings indexSettings, @Assisted String name, @Assisted Settings
settings) {
super(index, indexSettings, name, settings);
analyzer = new SpanishAnalyzer(version,
Analysis.parseStopWords(settings,
ArabicAnalyzer.getDefaultStopSet()),
Analysis.parseStemExclusion(settings,
CharArraySet.EMPTY_SET));
}