Strange behaviour wtih html_strip and stemmer filter combination

vineeth_mohan · August 3, 2012, 5:51am

Hi ,

I am using this script to create a custom analyzer which applies a stemmer
and html_strip.

SCRIPT - https://gist.github.com/3244856

But in the result , the word "thi" is recognized as a token (This is
somehow obtained from the word this)

Output - https://gist.github.com/3244869

While not using html_strip , this is not happening. It remove the token
this as expected.

Thanks
Vineeth

Igor_Motov · August 4, 2012, 1:23am

Vineeth,

In case of the "content" analyzer, your text goes through the stemmer
filter before it reaches the stop word filter. When the token "this" goes
through the stemmer filter it is converted into "thi". The token "thi" is
not a stop word and therefore the stop word filter passes it through. It
might make sense to put the stop word filter before the stemmer filter to
avoid this problem:

                "filter" : ["lowercase", "stop", "stemmer"],

Igor

On Friday, August 3, 2012 1:51:27 AM UTC-4, Vineeth Mohan wrote:

Hi ,

I am using this script to create a custom analyzer which applies a stemmer
and html_strip.

SCRIPT - gist:3244856 · GitHub

But in the result , the word "thi" is recognized as a token (This is
somehow obtained from the word this)

Output - gist:3244869 · GitHub

While not using html_strip , this is not happening. It remove the token
this as expected.

Thanks
Vineeth

vineeth_mohan · August 4, 2012, 1:54am

Hi ,

From our chat , it turns out the default stemmer is porter stemmer and it
does this operation as i havnt mentioned the language.

Thanks
Vineeth

On Sat, Aug 4, 2012 at 6:53 AM, Igor Motov imotov@gmail.com wrote:

Vineeth,

In case of the "content" analyzer, your text goes through the stemmer
filter before it reaches the stop word filter. When the token "this" goes
through the stemmer filter it is converted into "thi". The token "thi" is
not a stop word and therefore the stop word filter passes it through. It
might make sense to put the stop word filter before the stemmer filter to
avoid this problem:
                "filter" : ["lowercase", "stop", "stemmer"],
Igor

On Friday, August 3, 2012 1:51:27 AM UTC-4, Vineeth Mohan wrote:

Hi ,

I am using this script to create a custom analyzer which applies a
stemmer and html_strip.

SCRIPT - https://gist.github.com/**3244856 https://gist.github.com/3244856

But in the result , the word "thi" is recognized as a token (This is
somehow obtained from the word this)

Output - https://gist.github.com/**3244869 https://gist.github.com/3244869

While not using html_strip , this is not happening. It remove the token
this as expected.

Thanks
Vineeth