Hello ,
I have been trying to make the baseform plugin work , but its not working
for me.
I tried it with the _analyse API end point , but rather than giving both
variants of the word , its giving 2 repetition of the same word.
For eg:
curl -XGET
'localhost:9200/xyz/_analyze?tokenizer=letter&filters=baseform&pretty' -d
'sweltering'
{
"tokens" : [ {
"token" : "sweltering",
"start_offset" : 0,
"end_offset" : 10,
"type" : "word",
"position" : 1
}, {
"token" : "sweltering",
"start_offset" : 0,
"end_offset" : 10,
"type" : "word",
"position" : 1
} ]
}
Here i was expecting sweltering to be reduced to swelter but sweltering
has come twice and not the baseform.
I tried this on both 0.90 and 1+ version of elasticsearch and I am seeing
the same wrong output.
Is there anything wrong in how i have setup the plugin or is it an issue on
plugin side ?
Thanks
Vineeth
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com .
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAGdPd5m4S2UydYTEJeiAp75se%3DK1OB6RQvs1sZRnfaq6NmfGhA%40mail.gmail.com .
For more options, visit https://groups.google.com/d/optout .
jprante
(Jörg Prante)
March 12, 2014, 6:07pm
2
I assume you have correctly set english language as in the example.
The baseform plugin is based on training data for english language, it is
possible that sweltering
is not recognized.
You can add missing words to the training data file in the plugin source
https://github.com/jprante/elasticsearch-analysis-baseform/blob/master/src/main/resources/en-lemma-utf8.txt
and recompile. Patches are welcome!
Jörg
On Wed, Mar 12, 2014 at 2:33 PM, vineeth mohan vm.vineethmohan@gmail.com wrote:
Hello ,
I have been trying to make the baseform plugin work , but its not working
for me.
I tried it with the _analyse API end point , but rather than giving both
variants of the word , its giving 2 repetition of the same word.
For eg:
curl -XGET
'localhost:9200/xyz/_analyze?tokenizer=letter&filters=baseform&pretty' -d
'sweltering'
{
"tokens" : [ {
"token" : "sweltering",
"start_offset" : 0,
"end_offset" : 10,
"type" : "word",
"position" : 1
}, {
"token" : "sweltering",
"start_offset" : 0,
"end_offset" : 10,
"type" : "word",
"position" : 1
} ]
}
Here i was expecting sweltering to be reduced to swelter but sweltering
has come twice and not the baseform.
I tried this on both 0.90 and 1+ version of elasticsearch and I am seeing
the same wrong output.
Is there anything wrong in how i have setup the plugin or is it an issue
on plugin side ?
Thanks
Vineeth
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com .
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGm-Ti2D1D6t_Hf6Z5Z8g04bGRChOOsznWyAHYFaf_4HQ%40mail.gmail.com .
For more options, visit https://groups.google.com/d/optout .
Hello Joerg ,
I have taken an example from the txt fine you have pointed . I am still
seeing the same -
Kindly check
curl -XGET
'localhost:9200/relations/_analyze?tokenizer=letter&filters=baseform&pretty'
-d 'sweets'
{
"tokens" : [ {
"token" : "sweets",
"start_offset" : 0,
"end_offset" : 6,
"type" : "word",
"position" : 1
}, {
"token" : "sweets",
"start_offset" : 0,
"end_offset" : 6,
"type" : "word",
"position" : 1
} ]
}
Thanks
Vineeth
On Wed, Mar 12, 2014 at 11:37 PM, joergprante@gmail.com <
joergprante@gmail.com > wrote:
I assume you have correctly set english language as in the example.
The baseform plugin is based on training data for english language, it is
possible that sweltering
is not recognized.
You can add missing words to the training data file in the plugin source
https://github.com/jprante/elasticsearch-analysis-baseform/blob/master/src/main/resources/en-lemma-utf8.txt
and recompile. Patches are welcome!
Jörg
On Wed, Mar 12, 2014 at 2:33 PM, vineeth mohan vm.vineethmohan@gmail.com wrote:
Hello ,
I have been trying to make the baseform plugin work , but its not working
for me.
I tried it with the _analyse API end point , but rather than giving both
variants of the word , its giving 2 repetition of the same word.
For eg:
curl -XGET
'localhost:9200/xyz/_analyze?tokenizer=letter&filters=baseform&pretty' -d
'sweltering'
{
"tokens" : [ {
"token" : "sweltering",
"start_offset" : 0,
"end_offset" : 10,
"type" : "word",
"position" : 1
}, {
"token" : "sweltering",
"start_offset" : 0,
"end_offset" : 10,
"type" : "word",
"position" : 1
} ]
}
Here i was expecting sweltering to be reduced to swelter but sweltering
has come twice and not the baseform.
I tried this on both 0.90 and 1+ version of elasticsearch and I am seeing
the same wrong output.
Is there anything wrong in how i have setup the plugin or is it an issue
on plugin side ?
Thanks
Vineeth
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com .
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAGdPd5%3Dv5infdCUY49aNSZErqkQ-aYijXBq4ViVhD77K83igYA%40mail.gmail.com .
For more options, visit https://groups.google.com/d/optout .