because all of you are eager to keep up with Lucene 3.5 features, I just
wrote an ElasticSearch Hunspell Analysis plugin.
Project URL:
For discussion, see
Please note: included are hunspell dict/aff files from Chromium for
convenience. The license for the third-party files is a tri-license MPL
1.1/GPL 2.0/LGPL.
This feature looks great.
Correct me if I misunderstood: your plugin corrects spelling errors before
indexing ?
And when a user runs a search, terms are modified (spell check) before running
the search ?
Please note: included are hunspell dict/aff files from Chromium for
convenience. The license for the third-party files is a tri-license MPL
1.1/GPL 2.0/LGPL.
I forgot to mention, right now, hunspell is used in Lucene as a token
filter for stemming.
"TokenFilter that uses hunspell affix rules and words to stem tokens.
Since hunspell supports a word having multiple stems, this filter can
emit multiple tokens for each consumed token."
From a linguistic point of view, hunspell stemming is an improvement
over snowball stemming.
"Hunspell provides stemming for all languages that have OpenOffice
spellcheck dictionaries. Being dictionary based, it requires high
quality and well maintained dictionaries to work well for stemming -
in which case it may give more precise stemming than the Snowball
algorithms."
This feature looks great.
Correct me if I misunderstood: your plugin corrects spelling errors before
indexing ?
And when a user runs a search, terms are modified (spell check) before running
the search ?
Please note: included are hunspell dict/aff files from Chromium for
convenience. The license for the third-party files is a tri-license MPL
1.1/GPL 2.0/LGPL.
I forgot to mention, right now, hunspell is used in Lucene as a token
filter for stemming.
"TokenFilter that uses hunspell affix rules and words to stem tokens.
Since hunspell supports a word having multiple stems, this filter can
emit multiple tokens for each consumed token."
From a linguistic point of view, hunspell stemming is an improvement
over snowball stemming.
"Hunspell provides stemming for all languages that have OpenOffice
spellcheck dictionaries. Being dictionary based, it requires high
quality and well maintained dictionaries to work well for stemming -
in which case it may give more precise stemming than the Snowball
algorithms."
This feature looks great.
Correct me if I misunderstood: your plugin corrects spelling errors
before
indexing ?
And when a user runs a search, terms are modified (spell check) before
running
the search ?
Please note: included are hunspell dict/aff files from Chromium for
convenience. The license for the third-party files is a tri-license MPL
1.1/GPL 2.0/LGPL.
A Big Thanks to the Chromium folks at Google. They selected hunspell
dicts from Open Office and from all over the place. Unfortunately I
had to skip the dic_delta / bdic effort which Chromium uses for dict
enhancement, for details see README.chromium. Two aff files couldn't
get parsed by Lucene's HunspellDictionary because of missing SET tags.
But I was able to fix that. At least I hope so, the parse exception
went away. Did not much testing though.
The idea for future development is to create morpheme lists (finite
state transducer) on the fly while indexing words to ES and prepare
custom dict/aff files for spell check and autosuggestion.
I forgot to mention, right now, hunspell is used in Lucene as a token
filter for stemming.
"TokenFilter that uses hunspell affix rules and words to stem tokens.
Since hunspell supports a word having multiple stems, this filter can
emit multiple tokens for each consumed token."
From a linguistic point of view, hunspell stemming is an improvement
over snowball stemming.
"Hunspell provides stemming for all languages that have OpenOffice
spellcheck dictionaries. Being dictionary based, it requires high
quality and well maintained dictionaries to work well for stemming -
in which case it may give more precise stemming than the Snowball
algorithms."
This feature looks great.
Correct me if I misunderstood: your plugin corrects spelling errors
before
indexing ?
And when a user runs a search, terms are modified (spell check) before
running
the search ?
Please note: included are hunspell dict/aff files from Chromium for
convenience. The license for the third-party files is a tri-license MPL
1.1/GPL 2.0/LGPL.
A Big Thanks to the Chromium folks at Google. They selected hunspell
dicts from Open Office and from all over the place. Unfortunately I
had to skip the dic_delta / bdic effort which Chromium uses for dict
enhancement, for details see README.chromium. Two aff files couldn't
get parsed by Lucene's HunspellDictionary because of missing SET tags.
But I was able to fix that. At least I hope so, the parse exception
went away. Did not much testing though.
Great!.
The idea for future development is to create morpheme lists (finite
state transducer) on the fly while indexing words to ES and prepare
custom dict/aff files for spell check and autosuggestion.
That would be great to have. Especially one that is realtime!
I forgot to mention, right now, hunspell is used in Lucene as a token
filter for stemming.
"TokenFilter that uses hunspell affix rules and words to stem tokens.
Since hunspell supports a word having multiple stems, this filter can
emit multiple tokens for each consumed token."
From a linguistic point of view, hunspell stemming is an improvement
over snowball stemming.
"Hunspell provides stemming for all languages that have OpenOffice
spellcheck dictionaries. Being dictionary based, it requires high
quality and well maintained dictionaries to work well for stemming -
in which case it may give more precise stemming than the Snowball
algorithms."
This feature looks great.
Correct me if I misunderstood: your plugin corrects spelling errors
before
indexing ?
And when a user runs a search, terms are modified (spell check)
before
running
the search ?
Please note: included are hunspell dict/aff files from Chromium for
convenience. The license for the third-party files is a
tri-license MPL
1.1/GPL 2.0/LGPL.
For discussion, seehttps://github.com/elasticsearch/elasticsearch/issues/646
Please note: included are hunspell dict/aff files from Chromium for
convenience. The license for the third-party files is a tri-license MPL
1.1/GPL 2.0/LGPL.
Great job.
But installation proccess is not working..
We miss the compiled jar available for downloading form github to
install it on elasticsearch via the plugin utillity.
For discussion, seehttps://github.com/elasticsearch/elasticsearch/issues/646
Please note: included are hunspell dict/aff files from Chromium for
convenience. The license for the third-party files is a tri-license MPL
1.1/GPL 2.0/LGPL.
Great job.
But installation proccess is not working..
We miss the compiled jar available for downloading form github to
install it on elasticsearch via the plugin utillity.
I gave a hunspell plugin a try and have some doubts whether it can really
qualify as a stemmer. The problem I see with it is that it can emit way too
many different options for some terms (especially short one) that this can
IMO seriously harm the relevancy. I was testing it for the Czech language
but I guess the same situation is for some other languages as well (based
on my short test English seems to work a lot better).
I can clearly see benefit of hunspell as a spelling tool but stemmer? I am
not familiar with hunspell API but are there any options that can influence
the stemming process that might be useful to expose tinES plugin API as
well?
Please note: included are hunspell dict/aff files from Chromium for
convenience. The license for the third-party files is a tri-license MPL
1.1/GPL 2.0/LGPL.
Great job.
But installation proccess is not working..
We miss the compiled jar available for downloading form github to
install it on elasticsearch via the plugin utillity.
Just to give an illustration, there is a czech word "rada" which in given
context means "board" (but it can also mean "advice").
Hunspell with cs_CZ locale yields the following terms:
rada (board)
rada (the same term but I guess it is meant that this time it means advice)
raď (give advice - a verb)
radon (radon - a noun)
I gave a hunspell plugin a try and have some doubts whether it can really
qualify as a stemmer. The problem I see with it is that it can emit way too
many different options for some terms (especially short one) that this can
IMO seriously harm the relevancy. I was testing it for the Czech language
but I guess the same situation is for some other languages as well (based
on my short test English seems to work a lot better).
I can clearly see benefit of hunspell as a spelling tool but stemmer? I am
not familiar with hunspell API but are there any options that can influence
the stemming process that might be useful to expose tinES plugin API as
well?
Please note: included are hunspell dict/aff files from Chromium for
convenience. The license for the third-party files is a tri-license
MPL
1.1/GPL 2.0/LGPL.
Great job.
But installation proccess is not working..
We miss the compiled jar available for downloading form github to
install it on elasticsearch via the plugin utillity.
thanks for pointing this out. Yes, it's true, the use of hunspell for
stemming must be carefully evaluated for each dictionary. See also the
warnings in
I assume the czeck dictionary I found in Chromium is not the best
choice.
To be honest, I am just in the process of learning to write
Elasticsearch plugins, and I started with a very tiny project. Most
attractive was a feature that appeared in Lucene 3.5, the hunspell
stem filter.
In a more advanced dictionary plugin I am busy with, I will use
hunspell dictionaries in the more appropriate way, that is, for spell
suggestions.
Just to give an illustration, there is a czech word "rada" which in given
context means "board" (but it can also mean "advice").
Hunspell with cs_CZ locale yields the following terms:
rada (board)
rada (the same term but I guess it is meant that this time it means advice)
raď (give advice - a verb)
radon (radon - a noun)
I gave a hunspell plugin a try and have some doubts whether it can really
qualify as a stemmer. The problem I see with it is that it can emit way too
many different options for some terms (especially short one) that this can
IMO seriously harm the relevancy. I was testing it for the Czech language
but I guess the same situation is for some other languages as well (based
on my short test English seems to work a lot better).
I can clearly see benefit of hunspell as a spelling tool but stemmer? I am
not familiar with hunspell API but are there any options that can influence
the stemming process that might be useful to expose tinES plugin API as
well?
Please note: included are hunspell dict/aff files from Chromium for
convenience. The license for the third-party files is a tri-license
MPL
1.1/GPL 2.0/LGPL.
Great job.
But installation proccess is not working..
We miss the compiled jar available for downloading form github to
install it on elasticsearch via the plugin utillity.
thanks for pointing this out. Yes, it's true, the use of hunspell for
stemming must be carefully evaluated for each dictionary. See also the
warnings in
I assume the czeck dictionary I found in Chromium is not the best
choice.
To be honest, I am just in the process of learning to write
Elasticsearch plugins, and I started with a very tiny project. Most
attractive was a feature that appeared in Lucene 3.5, the hunspell
stem filter.
In a more advanced dictionary plugin I am busy with, I will use
hunspell dictionaries in the more appropriate way, that is, for spell
suggestions.
Just to give an illustration, there is a czech word "rada" which in given
context means "board" (but it can also mean "advice").
Hunspell with cs_CZ locale yields the following terms:
rada (board)
rada (the same term but I guess it is meant that this time it means
advice)
raď (give advice - a verb)
radon (radon - a noun)
I gave a hunspell plugin a try and have some doubts whether it can
really
qualify as a stemmer. The problem I see with it is that it can emit
way too
many different options for some terms (especially short one) that this
can
IMO seriously harm the relevancy. I was testing it for the Czech
language
but I guess the same situation is for some other languages as well
(based
on my short test English seems to work a lot better).
I can clearly see benefit of hunspell as a spelling tool but stemmer?
I am
not familiar with hunspell API but are there any options that can
influence
the stemming process that might be useful to expose tinES plugin API as
well?
Please note: included are hunspell dict/aff files from Chromium
for
convenience. The license for the third-party files is a
tri-license
MPL
1.1/GPL 2.0/LGPL.
Great job.
But installation proccess is not working..
We miss the compiled jar available for downloading form github to
install it on elasticsearch via the plugin utillity.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.