I am interested in using elasticsearch for our website suttacentral.net,
I've tried ES and found it pleasant to use with obvious power, the only
challenge is that on suttacentral we host many buddhist texts in ancient
languages, particularly the pali language, suffix to say there are no
existing stemmers. Stemming is a vital step for searching because pali is a
highly inflected language (like latin). The actual stemming step is
straightforward enough, presently we use a custom stemmer I wrote in
python, it's dead simple and I wouldn't have much trouble implementing the
same code in java (i.e. as a function which takes an inflected word as a
string, and returns the stem as another string). Where I'm in the dark is
making ES call that code.
All the example stemmer plugins I've found are adapting existing stemmers
to ES. What I really just want is a way to call a function on each token
and use the return value of that function. It seems to me that should be
simple enough but I've not managed to find any simple minimalistic code to
use as a template. Although it would be noble at this point I'm not
interested in making a proper plugin, I would be happy with the barest
bodge/hack that would achieve the desired affect!
If anyone could point me in the right direction, either to a minimalistic
code example, or outline what it would involve, I would be gratefully
appreciative.
On Monday, July 7, 2014 8:57:09 PM UTC-4, Nandiya Bhikkhu wrote:
I am interested in using elasticsearch for our website suttacentral.net,
I've tried ES and found it pleasant to use with obvious power, the only
challenge is that on suttacentral we host many buddhist texts in ancient
languages, particularly the pali language, suffix to say there are no
existing stemmers. Stemming is a vital step for searching because pali is a
highly inflected language (like latin). The actual stemming step is
straightforward enough, presently we use a custom stemmer I wrote in
python, it's dead simple and I wouldn't have much trouble implementing the
same code in java (i.e. as a function which takes an inflected word as a
string, and returns the stem as another string). Where I'm in the dark is
making ES call that code.
All the example stemmer plugins I've found are adapting existing stemmers
to ES. What I really just want is a way to call a function on each token
and use the return value of that function. It seems to me that should
be simple enough but I've not managed to find any simple minimalistic code
to use as a template. Although it would be noble at this point I'm not
interested in making a proper plugin, I would be happy with the barest
bodge/hack that would achieve the desired affect!
If anyone could point me in the right direction, either to a minimalistic
code example, or outline what it would involve, I would be gratefully
appreciative.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.