Hi ,
I am using this script to create a custom analyzer which applies a stemmer
and html_strip.
SCRIPT - https://gist.github.com/3244856
But in the result , the word "thi" is recognized as a token (This is
somehow obtained from the word this)
Output - https://gist.github.com/3244869
While not using html_strip , this is not happening. It remove the token
this as expected.
Thanks
Vineeth
Vineeth,
In case of the "content" analyzer, your text goes through the stemmer
filter before it reaches the stop word filter. When the token "this" goes
through the stemmer filter it is converted into "thi". The token "thi" is
not a stop word and therefore the stop word filter passes it through. It
might make sense to put the stop word filter before the stemmer filter to
avoid this problem:
"filter" : ["lowercase", "stop", "stemmer"],
Igor
On Friday, August 3, 2012 1:51:27 AM UTC-4, Vineeth Mohan wrote:
Hi ,
I am using this script to create a custom analyzer which applies a stemmer
and html_strip.
SCRIPT - https://gist.github.com/3244856
But in the result , the word "thi" is recognized as a token (This is
somehow obtained from the word this)
Output - https://gist.github.com/3244869
While not using html_strip , this is not happening. It remove the token
this as expected.
Thanks
Vineeth
Hi ,
From our chat , it turns out the default stemmer is porter stemmer and it
does this operation as i havnt mentioned the language.
Thanks
Vineeth
On Sat, Aug 4, 2012 at 6:53 AM, Igor Motov imotov@gmail.com wrote:
Vineeth,
In case of the "content" analyzer, your text goes through the stemmer
filter before it reaches the stop word filter. When the token "this" goes
through the stemmer filter it is converted into "thi". The token "thi" is
not a stop word and therefore the stop word filter passes it through. It
might make sense to put the stop word filter before the stemmer filter to
avoid this problem:
"filter" : ["lowercase", "stop", "stemmer"],
Igor
On Friday, August 3, 2012 1:51:27 AM UTC-4, Vineeth Mohan wrote:
Hi ,
I am using this script to create a custom analyzer which applies a
stemmer and html_strip.
SCRIPT - https://gist.github.com/**3244856https://gist.github.com/3244856
But in the result , the word "thi" is recognized as a token (This is
somehow obtained from the word this)
Output - https://gist.github.com/**3244869https://gist.github.com/3244869
While not using html_strip , this is not happening. It remove the token
this as expected.
Thanks
Vineeth