Strange behaviour wtih html_strip and stemmer filter combination


(vineeth mohan) #1

Hi ,

I am using this script to create a custom analyzer which applies a stemmer
and html_strip.

SCRIPT - https://gist.github.com/3244856

But in the result , the word "thi" is recognized as a token (This is
somehow obtained from the word this)

Output - https://gist.github.com/3244869

While not using html_strip , this is not happening. It remove the token
this as expected.

Thanks
Vineeth


(Igor Motov) #2

Vineeth,

In case of the "content" analyzer, your text goes through the stemmer
filter before it reaches the stop word filter. When the token "this" goes
through the stemmer filter it is converted into "thi". The token "thi" is
not a stop word and therefore the stop word filter passes it through. It
might make sense to put the stop word filter before the stemmer filter to
avoid this problem:

                "filter" : ["lowercase", "stop", "stemmer"],

Igor

On Friday, August 3, 2012 1:51:27 AM UTC-4, Vineeth Mohan wrote:

Hi ,

I am using this script to create a custom analyzer which applies a stemmer
and html_strip.

SCRIPT - https://gist.github.com/3244856

But in the result , the word "thi" is recognized as a token (This is
somehow obtained from the word this)

Output - https://gist.github.com/3244869

While not using html_strip , this is not happening. It remove the token
this as expected.

Thanks
Vineeth


(vineeth mohan) #3

Hi ,

From our chat , it turns out the default stemmer is porter stemmer and it
does this operation as i havnt mentioned the language.

Thanks
Vineeth

On Sat, Aug 4, 2012 at 6:53 AM, Igor Motov imotov@gmail.com wrote:

Vineeth,

In case of the "content" analyzer, your text goes through the stemmer
filter before it reaches the stop word filter. When the token "this" goes
through the stemmer filter it is converted into "thi". The token "thi" is
not a stop word and therefore the stop word filter passes it through. It
might make sense to put the stop word filter before the stemmer filter to
avoid this problem:

                "filter" : ["lowercase", "stop", "stemmer"],

Igor

On Friday, August 3, 2012 1:51:27 AM UTC-4, Vineeth Mohan wrote:

Hi ,

I am using this script to create a custom analyzer which applies a
stemmer and html_strip.

SCRIPT - https://gist.github.com/**3244856https://gist.github.com/3244856

But in the result , the word "thi" is recognized as a token (This is
somehow obtained from the word this)

Output - https://gist.github.com/**3244869https://gist.github.com/3244869

While not using html_strip , this is not happening. It remove the token
this as expected.

Thanks
Vineeth


(system) #4