Baseform plugin not working for me


(vineeth mohan-2) #1

Hello ,

I have been trying to make the baseform plugin work , but its not working
for me.

I tried it with the _analyse API end point , but rather than giving both
variants of the word , its giving 2 repetition of the same word.

For eg:

curl -XGET
'localhost:9200/xyz/_analyze?tokenizer=letter&filters=baseform&pretty' -d
'sweltering'
{
"tokens" : [ {
"token" : "sweltering",
"start_offset" : 0,
"end_offset" : 10,
"type" : "word",
"position" : 1
}, {
"token" : "sweltering",
"start_offset" : 0,
"end_offset" : 10,
"type" : "word",
"position" : 1
} ]
}

Here i was expecting sweltering to be reduced to swelter but sweltering
has come twice and not the baseform.

I tried this on both 0.90 and 1+ version of elasticsearch and I am seeing
the same wrong output.

Is there anything wrong in how i have setup the plugin or is it an issue on
plugin side ?

Thanks
Vineeth

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAGdPd5m4S2UydYTEJeiAp75se%3DK1OB6RQvs1sZRnfaq6NmfGhA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Jörg Prante) #2

I assume you have correctly set english language as in the example.

The baseform plugin is based on training data for english language, it is
possible that sweltering is not recognized.

You can add missing words to the training data file in the plugin source

https://github.com/jprante/elasticsearch-analysis-baseform/blob/master/src/main/resources/en-lemma-utf8.txt

and recompile. Patches are welcome!

Jörg

On Wed, Mar 12, 2014 at 2:33 PM, vineeth mohan vm.vineethmohan@gmail.comwrote:

Hello ,

I have been trying to make the baseform plugin work , but its not working
for me.

I tried it with the _analyse API end point , but rather than giving both
variants of the word , its giving 2 repetition of the same word.

For eg:

curl -XGET
'localhost:9200/xyz/_analyze?tokenizer=letter&filters=baseform&pretty' -d
'sweltering'
{
"tokens" : [ {
"token" : "sweltering",
"start_offset" : 0,
"end_offset" : 10,
"type" : "word",
"position" : 1
}, {
"token" : "sweltering",
"start_offset" : 0,
"end_offset" : 10,
"type" : "word",
"position" : 1
} ]
}

Here i was expecting sweltering to be reduced to swelter but sweltering
has come twice and not the baseform.

I tried this on both 0.90 and 1+ version of elasticsearch and I am seeing
the same wrong output.

Is there anything wrong in how i have setup the plugin or is it an issue
on plugin side ?

Thanks
Vineeth

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGm-Ti2D1D6t_Hf6Z5Z8g04bGRChOOsznWyAHYFaf_4HQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(vineeth mohan-2) #3

Hello Joerg ,

I have taken an example from the txt fine you have pointed . I am still
seeing the same -
Kindly check

curl -XGET
'localhost:9200/relations/_analyze?tokenizer=letter&filters=baseform&pretty'
-d 'sweets'
{
"tokens" : [ {
"token" : "sweets",
"start_offset" : 0,
"end_offset" : 6,
"type" : "word",
"position" : 1
}, {
"token" : "sweets",
"start_offset" : 0,
"end_offset" : 6,
"type" : "word",
"position" : 1
} ]
}

Thanks
Vineeth

On Wed, Mar 12, 2014 at 11:37 PM, joergprante@gmail.com <
joergprante@gmail.com> wrote:

I assume you have correctly set english language as in the example.

The baseform plugin is based on training data for english language, it is
possible that sweltering is not recognized.

You can add missing words to the training data file in the plugin source

https://github.com/jprante/elasticsearch-analysis-baseform/blob/master/src/main/resources/en-lemma-utf8.txt

and recompile. Patches are welcome!

Jörg

On Wed, Mar 12, 2014 at 2:33 PM, vineeth mohan vm.vineethmohan@gmail.comwrote:

Hello ,

I have been trying to make the baseform plugin work , but its not working
for me.

I tried it with the _analyse API end point , but rather than giving both
variants of the word , its giving 2 repetition of the same word.

For eg:

curl -XGET
'localhost:9200/xyz/_analyze?tokenizer=letter&filters=baseform&pretty' -d
'sweltering'
{
"tokens" : [ {
"token" : "sweltering",
"start_offset" : 0,
"end_offset" : 10,
"type" : "word",
"position" : 1
}, {
"token" : "sweltering",
"start_offset" : 0,
"end_offset" : 10,
"type" : "word",
"position" : 1
} ]
}

Here i was expecting sweltering to be reduced to swelter but sweltering
has come twice and not the baseform.

I tried this on both 0.90 and 1+ version of elasticsearch and I am seeing
the same wrong output.

Is there anything wrong in how i have setup the plugin or is it an issue
on plugin side ?

Thanks
Vineeth

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAGdPd5%3Dv5infdCUY49aNSZErqkQ-aYijXBq4ViVhD77K83igYA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(system) #4