What does it take to make a custom stemmer for ES?


(Nandiya Bhikkhu) #1

I am interested in using elasticsearch for our website suttacentral.net,
I've tried ES and found it pleasant to use with obvious power, the only
challenge is that on suttacentral we host many buddhist texts in ancient
languages, particularly the pali language, suffix to say there are no
existing stemmers. Stemming is a vital step for searching because pali is a
highly inflected language (like latin). The actual stemming step is
straightforward enough, presently we use a custom stemmer I wrote in
python, it's dead simple and I wouldn't have much trouble implementing the
same code in java (i.e. as a function which takes an inflected word as a
string, and returns the stem as another string). Where I'm in the dark is
making ES call that code.

All the example stemmer plugins I've found are adapting existing stemmers
to ES. What I really just want is a way to call a function on each token
and use the return value of that function. It seems to me that should be
simple enough but I've not managed to find any simple minimalistic code to
use as a template. Although it would be noble at this point I'm not
interested in making a proper plugin, I would be happy with the barest
bodge/hack that would achieve the desired affect!

If anyone could point me in the right direction, either to a minimalistic
code example, or outline what it would involve, I would be gratefully
appreciative.

Kind regards,
Nandiya Bhikkhu

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/fe2c777e-b823-4652-8f6c-ecf42ec36d33%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Otis Gospodnetić) #2

Hi Nandiya,

Have a look at Lucene and its source-code for token filters. You'd
implement a custom stemmer at Lucene level, and then just use that in ES.

Otis

Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/

On Monday, July 7, 2014 8:57:09 PM UTC-4, Nandiya Bhikkhu wrote:

I am interested in using elasticsearch for our website suttacentral.net,
I've tried ES and found it pleasant to use with obvious power, the only
challenge is that on suttacentral we host many buddhist texts in ancient
languages, particularly the pali language, suffix to say there are no
existing stemmers. Stemming is a vital step for searching because pali is a
highly inflected language (like latin). The actual stemming step is
straightforward enough, presently we use a custom stemmer I wrote in
python, it's dead simple and I wouldn't have much trouble implementing the
same code in java (i.e. as a function which takes an inflected word as a
string, and returns the stem as another string). Where I'm in the dark is
making ES call that code.

All the example stemmer plugins I've found are adapting existing stemmers
to ES. What I really just want is a way to call a function on each token
and use the return value of that function. It seems to me that should
be simple enough but I've not managed to find any simple minimalistic code
to use as a template. Although it would be noble at this point I'm not
interested in making a proper plugin, I would be happy with the barest
bodge/hack that would achieve the desired affect!

If anyone could point me in the right direction, either to a minimalistic
code example, or outline what it would involve, I would be gratefully
appreciative.

Kind regards,
Nandiya Bhikkhu

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f3b3a496-b434-41b4-84b9-733b3139202c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(system) #3