Protwords support?


(Ivan Brusic) #1

Is there any support for Solr's protwords (stemming overrides)? If
not, should be doable to modify the existing stemmer filter to provide
a list of words. Never looked up the Solr's implementation.

--
Ivan


(Shay Banon) #2

Which analyzer are you after? Most stemming based ones support
stem_exclusion setting.

On Thu, Apr 26, 2012 at 8:23 PM, Ivan Brusic ivan@brusic.com wrote:

Is there any support for Solr's protwords (stemming overrides)? If
not, should be doable to modify the existing stemmer filter to provide
a list of words. Never looked up the Solr's implementation.

--
Ivan


(Ivan Brusic) #3

Thanks! Never seen the stem_exclusion property. Currently looking to
translate an existing custom Lucene analyzer to ES.

Do only the language analyzers support stem_exclusion? It appears so.
It would be nice to use a stemmer filter directly with exclusions.
File-based, similar to stop words, would also be ideal. I can provide
a patch if it makes sense.

Cheers,

Ivan

On Fri, Apr 27, 2012 at 3:28 AM, Shay Banon kimchy@gmail.com wrote:

Which analyzer are you after? Most stemming based ones support
stem_exclusion setting.

On Thu, Apr 26, 2012 at 8:23 PM, Ivan Brusic ivan@brusic.com wrote:

Is there any support for Solr's protwords (stemming overrides)? If
not, should be doable to modify the existing stemmer filter to provide
a list of words. Never looked up the Solr's implementation.

--
Ivan


(Shay Banon) #4

The way that it works is when providing a stem exclusion,
a KeywordMarkerFilter is added with a hte provided list, then
the SnowballFilter ignores any token marked as keywork. We can add support
for KeywordMarkerFilter to allow for custom exclusion, but it will only
make sense with SnowballFilter afterwards (which is created with teh
stemmer filter in ES).

On Fri, Apr 27, 2012 at 7:52 PM, Ivan Brusic ivan@brusic.com wrote:

Thanks! Never seen the stem_exclusion property. Currently looking to
translate an existing custom Lucene analyzer to ES.

Do only the language analyzers support stem_exclusion? It appears so.
It would be nice to use a stemmer filter directly with exclusions.
File-based, similar to stop words, would also be ideal. I can provide
a patch if it makes sense.

Cheers,

Ivan

On Fri, Apr 27, 2012 at 3:28 AM, Shay Banon kimchy@gmail.com wrote:

Which analyzer are you after? Most stemming based ones support
stem_exclusion setting.

On Thu, Apr 26, 2012 at 8:23 PM, Ivan Brusic ivan@brusic.com wrote:

Is there any support for Solr's protwords (stemming overrides)? If
not, should be doable to modify the existing stemmer filter to provide
a list of words. Never looked up the Solr's implementation.

--
Ivan


(Ivan Brusic) #5

Finally getting a change to revisit this issue.

From what I can tell, it is not possible to modify an existing
analyzer, correct? Something like:

index :
analysis :
analyzer :
foobar :
type : english
tokenizer : mytokenizer
tokenizer :
mytokenizer : ...

In that case, I would need to create a custom analyzer with a stemmer
token filter with exclusions. Should I open an issue?

Cheers,

Ivan

On Sun, Apr 29, 2012 at 9:54 AM, Shay Banon kimchy@gmail.com wrote:

The way that it works is when providing a stem exclusion,
a KeywordMarkerFilter is added with a hte provided list, then
the SnowballFilter ignores any token marked as keywork. We can add support
for KeywordMarkerFilter to allow for custom exclusion, but it will only make
sense with SnowballFilter afterwards (which is created with teh stemmer
filter in ES).

On Fri, Apr 27, 2012 at 7:52 PM, Ivan Brusic ivan@brusic.com wrote:

Thanks! Never seen the stem_exclusion property. Currently looking to
translate an existing custom Lucene analyzer to ES.

Do only the language analyzers support stem_exclusion? It appears so.
It would be nice to use a stemmer filter directly with exclusions.
File-based, similar to stop words, would also be ideal. I can provide
a patch if it makes sense.

Cheers,

Ivan

On Fri, Apr 27, 2012 at 3:28 AM, Shay Banon kimchy@gmail.com wrote:

Which analyzer are you after? Most stemming based ones support
stem_exclusion setting.

On Thu, Apr 26, 2012 at 8:23 PM, Ivan Brusic ivan@brusic.com wrote:

Is there any support for Solr's protwords (stemming overrides)? If
not, should be doable to modify the existing stemmer filter to provide
a list of words. Never looked up the Solr's implementation.

--
Ivan


(Shay Banon) #6

When you set a tokenizer, you effectively set your own analyzer, so "type"
set to english is not really meaningful. I lost you a bit with what you are
after...

On Tue, Jun 5, 2012 at 1:11 AM, Ivan Brusic ivan@brusic.com wrote:

Finally getting a change to revisit this issue.

From what I can tell, it is not possible to modify an existing
analyzer, correct? Something like:

index :
analysis :
analyzer :
foobar :
type : english
tokenizer : mytokenizer
tokenizer :
mytokenizer : ...

In that case, I would need to create a custom analyzer with a stemmer
token filter with exclusions. Should I open an issue?

Cheers,

Ivan

On Sun, Apr 29, 2012 at 9:54 AM, Shay Banon kimchy@gmail.com wrote:

The way that it works is when providing a stem exclusion,
a KeywordMarkerFilter is added with a hte provided list, then
the SnowballFilter ignores any token marked as keywork. We can add
support
for KeywordMarkerFilter to allow for custom exclusion, but it will only
make
sense with SnowballFilter afterwards (which is created with teh stemmer
filter in ES).

On Fri, Apr 27, 2012 at 7:52 PM, Ivan Brusic ivan@brusic.com wrote:

Thanks! Never seen the stem_exclusion property. Currently looking to
translate an existing custom Lucene analyzer to ES.

Do only the language analyzers support stem_exclusion? It appears so.
It would be nice to use a stemmer filter directly with exclusions.
File-based, similar to stop words, would also be ideal. I can provide
a patch if it makes sense.

Cheers,

Ivan

On Fri, Apr 27, 2012 at 3:28 AM, Shay Banon kimchy@gmail.com wrote:

Which analyzer are you after? Most stemming based ones support
stem_exclusion setting.

On Thu, Apr 26, 2012 at 8:23 PM, Ivan Brusic ivan@brusic.com wrote:

Is there any support for Solr's protwords (stemming overrides)? If
not, should be doable to modify the existing stemmer filter to
provide

a list of words. Never looked up the Solr's implementation.

--
Ivan


(Ivan Brusic) #7

Sorry for being obtuse. The ultimate goal is to have stem_exclusions
on custom analyzers. Currently, this feature is only supported by
specific language analyzers.

Your previous suggestion for having a KeywordMarkerFilter with custom
exclusions is a good one. Yes, it's placement in the filter list is
important, but most of Lucene currently works like this anyways.
Reverse the order of filters, and all of a sudden code doesn't work!
:slight_smile:

My later point about creating analyzers which are not custom was
merely a question of syntax. It seems trivial to simply "add on" to an
existing analyzer, but it was unclear whether or not it would work. I
guess the answer is that it does not.

Cheers,

Ivan

On Fri, Jun 8, 2012 at 3:10 PM, Shay Banon kimchy@gmail.com wrote:

When you set a tokenizer, you effectively set your own analyzer, so "type"
set to english is not really meaningful. I lost you a bit with what you are
after...

On Tue, Jun 5, 2012 at 1:11 AM, Ivan Brusic ivan@brusic.com wrote:

Finally getting a change to revisit this issue.

From what I can tell, it is not possible to modify an existing
analyzer, correct? Something like:

index :
analysis :
analyzer :
foobar :
type : english
tokenizer : mytokenizer
tokenizer :
mytokenizer : ...

In that case, I would need to create a custom analyzer with a stemmer
token filter with exclusions. Should I open an issue?

Cheers,

Ivan

On Sun, Apr 29, 2012 at 9:54 AM, Shay Banon kimchy@gmail.com wrote:

The way that it works is when providing a stem exclusion,
a KeywordMarkerFilter is added with a hte provided list, then
the SnowballFilter ignores any token marked as keywork. We can add
support
for KeywordMarkerFilter to allow for custom exclusion, but it will only
make
sense with SnowballFilter afterwards (which is created with teh stemmer
filter in ES).

On Fri, Apr 27, 2012 at 7:52 PM, Ivan Brusic ivan@brusic.com wrote:

Thanks! Never seen the stem_exclusion property. Currently looking to
translate an existing custom Lucene analyzer to ES.

Do only the language analyzers support stem_exclusion? It appears so.
It would be nice to use a stemmer filter directly with exclusions.
File-based, similar to stop words, would also be ideal. I can provide
a patch if it makes sense.

Cheers,

Ivan

On Fri, Apr 27, 2012 at 3:28 AM, Shay Banon kimchy@gmail.com wrote:

Which analyzer are you after? Most stemming based ones support
stem_exclusion setting.

On Thu, Apr 26, 2012 at 8:23 PM, Ivan Brusic ivan@brusic.com wrote:

Is there any support for Solr's protwords (stemming overrides)? If
not, should be doable to modify the existing stemmer filter to
provide
a list of words. Never looked up the Solr's implementation.

--
Ivan


(Shay Banon) #8

Yea, you can't do an "addon" for an existing analyzer, you need to fully
define the custom one. Regarding the filter, sounds good...

On Mon, Jun 11, 2012 at 8:06 PM, Ivan Brusic ivan@brusic.com wrote:

Sorry for being obtuse. The ultimate goal is to have stem_exclusions
on custom analyzers. Currently, this feature is only supported by
specific language analyzers.

Your previous suggestion for having a KeywordMarkerFilter with custom
exclusions is a good one. Yes, it's placement in the filter list is
important, but most of Lucene currently works like this anyways.
Reverse the order of filters, and all of a sudden code doesn't work!
:slight_smile:

My later point about creating analyzers which are not custom was
merely a question of syntax. It seems trivial to simply "add on" to an
existing analyzer, but it was unclear whether or not it would work. I
guess the answer is that it does not.

Cheers,

Ivan

On Fri, Jun 8, 2012 at 3:10 PM, Shay Banon kimchy@gmail.com wrote:

When you set a tokenizer, you effectively set your own analyzer, so
"type"
set to english is not really meaningful. I lost you a bit with what you
are
after...

On Tue, Jun 5, 2012 at 1:11 AM, Ivan Brusic ivan@brusic.com wrote:

Finally getting a change to revisit this issue.

From what I can tell, it is not possible to modify an existing
analyzer, correct? Something like:

index :
analysis :
analyzer :
foobar :
type : english
tokenizer : mytokenizer
tokenizer :
mytokenizer : ...

In that case, I would need to create a custom analyzer with a stemmer
token filter with exclusions. Should I open an issue?

Cheers,

Ivan

On Sun, Apr 29, 2012 at 9:54 AM, Shay Banon kimchy@gmail.com wrote:

The way that it works is when providing a stem exclusion,
a KeywordMarkerFilter is added with a hte provided list, then
the SnowballFilter ignores any token marked as keywork. We can add
support
for KeywordMarkerFilter to allow for custom exclusion, but it will
only

make
sense with SnowballFilter afterwards (which is created with teh
stemmer

filter in ES).

On Fri, Apr 27, 2012 at 7:52 PM, Ivan Brusic ivan@brusic.com wrote:

Thanks! Never seen the stem_exclusion property. Currently looking to
translate an existing custom Lucene analyzer to ES.

Do only the language analyzers support stem_exclusion? It appears so.
It would be nice to use a stemmer filter directly with exclusions.
File-based, similar to stop words, would also be ideal. I can
provide

a patch if it makes sense.

Cheers,

Ivan

On Fri, Apr 27, 2012 at 3:28 AM, Shay Banon kimchy@gmail.com
wrote:

Which analyzer are you after? Most stemming based ones support
stem_exclusion setting.

On Thu, Apr 26, 2012 at 8:23 PM, Ivan Brusic ivan@brusic.com
wrote:

Is there any support for Solr's protwords (stemming overrides)? If
not, should be doable to modify the existing stemmer filter to
provide
a list of words. Never looked up the Solr's implementation.

--
Ivan


(system) #9