Ann: ElasticSearch Hunspell Analysis plugin


(Jörg Prante) #1

Hi,

because all of you are eager to keep up with Lucene 3.5 features, I just
wrote an ElasticSearch Hunspell Analysis plugin.

Project URL:

For discussion, see

Please note: included are hunspell dict/aff files from Chromium for
convenience. The license for the third-party files is a tri-license MPL
1.1/GPL 2.0/LGPL.

Example usage:

index:
analysis:
filter:
hunspell_de:
type: hunspell
locale: de_DE
ignoreCase: true

Cheers,

Jörg


(David Pilato) #2

Hi Jörg,

This feature looks great.
Correct me if I misunderstood: your plugin corrects spelling errors before
indexing ?
And when a user runs a search, terms are modified (spell check) before running
the search ?

Is that it ?

Thanks
David.

Le 29 décembre 2011 à 21:45, "Jörg Prante" joergprante@gmail.com a écrit :

Hi,

because all of you are eager to keep up with Lucene 3.5 features, I just
wrote an ElasticSearch Hunspell Analysis plugin.

Project URL:
https://github.com/jprante/elasticsearch-analysis-hunspell

For discussion, see
https://github.com/elasticsearch/elasticsearch/issues/646

Please note: included are hunspell dict/aff files from Chromium for
convenience. The license for the third-party files is a tri-license MPL
1.1/GPL 2.0/LGPL.

Example usage:

index:
analysis:
filter:
hunspell_de:
type: hunspell
locale: de_DE
ignoreCase: true

Cheers,

Jörg

--
David Pilato
http://dev.david.pilato.fr/
Twitter : @dadoonet


(Jörg Prante) #3

Hi David,

I forgot to mention, right now, hunspell is used in Lucene as a token
filter for stemming.

"TokenFilter that uses hunspell affix rules and words to stem tokens.
Since hunspell supports a word having multiple stems, this filter can
emit multiple tokens for each consumed token."

http://lucene.apache.org/java/3_5_0/api/contrib-analyzers/org/apache/lucene/analysis/hunspell/HunspellStemFilter.html

From a linguistic point of view, hunspell stemming is an improvement
over snowball stemming.

"Hunspell provides stemming for all languages that have OpenOffice
spellcheck dictionaries. Being dictionary based, it requires high
quality and well maintained dictionaries to work well for stemming -
in which case it may give more precise stemming than the Snowball
algorithms."

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

Jörg

On Dec 29, 10:17 pm, "da...@pilato.fr" da...@pilato.fr wrote:

Hi Jörg,

This feature looks great.
Correct me if I misunderstood: your plugin corrects spelling errors before
indexing ?
And when a user runs a search, terms are modified (spell check) before running
the search ?

Is that it ?

Thanks
David.

Le 29 décembre 2011 à 21:45, "Jörg Prante" joergpra...@gmail.com a écrit :

Hi,

because all of you are eager to keep up with Lucene 3.5 features, I just
wrote an ElasticSearch Hunspell Analysis plugin.

Project URL:
https://github.com/jprante/elasticsearch-analysis-hunspell

For discussion, see
https://github.com/elasticsearch/elasticsearch/issues/646

Please note: included are hunspell dict/aff files from Chromium for
convenience. The license for the third-party files is a tri-license MPL
1.1/GPL 2.0/LGPL.

Example usage:

index:
analysis:
filter:
hunspell_de:
type: hunspell
locale: de_DE
ignoreCase: true

Cheers,

Jörg

--
David Pilatohttp://dev.david.pilato.fr/
Twitter : @dadoonet


(Shay Banon) #4

This looks great!, especially with all the built in dicts. Did you download
those from open office?

On Fri, Dec 30, 2011 at 9:23 AM, jprante joergprante@gmail.com wrote:

Hi David,

I forgot to mention, right now, hunspell is used in Lucene as a token
filter for stemming.

"TokenFilter that uses hunspell affix rules and words to stem tokens.
Since hunspell supports a word having multiple stems, this filter can
emit multiple tokens for each consumed token."

http://lucene.apache.org/java/3_5_0/api/contrib-analyzers/org/apache/lucene/analysis/hunspell/HunspellStemFilter.html

From a linguistic point of view, hunspell stemming is an improvement
over snowball stemming.

"Hunspell provides stemming for all languages that have OpenOffice
spellcheck dictionaries. Being dictionary based, it requires high
quality and well maintained dictionaries to work well for stemming -
in which case it may give more precise stemming than the Snowball
algorithms."

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

Jörg

On Dec 29, 10:17 pm, "da...@pilato.fr" da...@pilato.fr wrote:

Hi Jörg,

This feature looks great.
Correct me if I misunderstood: your plugin corrects spelling errors
before
indexing ?
And when a user runs a search, terms are modified (spell check) before
running
the search ?

Is that it ?

Thanks
David.

Le 29 décembre 2011 à 21:45, "Jörg Prante" joergpra...@gmail.com a
écrit :

Hi,

because all of you are eager to keep up with Lucene 3.5 features, I
just

wrote an ElasticSearch Hunspell Analysis plugin.

Project URL:
https://github.com/jprante/elasticsearch-analysis-hunspell

For discussion, see
https://github.com/elasticsearch/elasticsearch/issues/646

Please note: included are hunspell dict/aff files from Chromium for
convenience. The license for the third-party files is a tri-license MPL
1.1/GPL 2.0/LGPL.

Example usage:

index:
analysis:
filter:
hunspell_de:
type: hunspell
locale: de_DE
ignoreCase: true

Cheers,

Jörg

--
David Pilatohttp://dev.david.pilato.fr/
Twitter : @dadoonet


(Jörg Prante) #5

I downloaded them from Chromium git.

git clone http://git.chromium.org/chromium/deps/hunspell_dictionaries.git

A Big Thanks to the Chromium folks at Google. They selected hunspell
dicts from Open Office and from all over the place. Unfortunately I
had to skip the dic_delta / bdic effort which Chromium uses for dict
enhancement, for details see README.chromium. Two aff files couldn't
get parsed by Lucene's HunspellDictionary because of missing SET tags.
But I was able to fix that. At least I hope so, the parse exception
went away. Did not much testing though.

The idea for future development is to create morpheme lists (finite
state transducer) on the fly while indexing words to ES and prepare
custom dict/aff files for spell check and autosuggestion.

Jörg

On Dec 30, 10:55 pm, Shay Banon kim...@gmail.com wrote:

This looks great!, especially with all the built in dicts. Did you download
those from open office?

On Fri, Dec 30, 2011 at 9:23 AM, jprante joergpra...@gmail.com wrote:

Hi David,

I forgot to mention, right now, hunspell is used in Lucene as a token
filter for stemming.

"TokenFilter that uses hunspell affix rules and words to stem tokens.
Since hunspell supports a word having multiple stems, this filter can
emit multiple tokens for each consumed token."

http://lucene.apache.org/java/3_5_0/api/contrib-analyzers/org/apache/...

From a linguistic point of view, hunspell stemming is an improvement
over snowball stemming.

"Hunspell provides stemming for all languages that have OpenOffice
spellcheck dictionaries. Being dictionary based, it requires high
quality and well maintained dictionaries to work well for stemming -
in which case it may give more precise stemming than the Snowball
algorithms."

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

Jörg

On Dec 29, 10:17 pm, "da...@pilato.fr" da...@pilato.fr wrote:

Hi Jörg,

This feature looks great.
Correct me if I misunderstood: your plugin corrects spelling errors
before
indexing ?
And when a user runs a search, terms are modified (spell check) before
running
the search ?

Is that it ?

Thanks
David.

Le 29 décembre 2011 à 21:45, "Jörg Prante" joergpra...@gmail.com a
écrit :

Hi,

because all of you are eager to keep up with Lucene 3.5 features, I
just

wrote an ElasticSearch Hunspell Analysis plugin.

Project URL:
https://github.com/jprante/elasticsearch-analysis-hunspell

For discussion, see
https://github.com/elasticsearch/elasticsearch/issues/646

Please note: included are hunspell dict/aff files from Chromium for
convenience. The license for the third-party files is a tri-license MPL
1.1/GPL 2.0/LGPL.

Example usage:

index:
analysis:
filter:
hunspell_de:
type: hunspell
locale: de_DE
ignoreCase: true

Cheers,

Jörg

--
David Pilatohttp://dev.david.pilato.fr/
Twitter : @dadoonet


(Shay Banon) #6

On Sat, Dec 31, 2011 at 1:22 AM, jprante joergprante@gmail.com wrote:

I downloaded them from Chromium git.

git clone http://git.chromium.org/chromium/deps/hunspell_dictionaries.git

A Big Thanks to the Chromium folks at Google. They selected hunspell
dicts from Open Office and from all over the place. Unfortunately I
had to skip the dic_delta / bdic effort which Chromium uses for dict
enhancement, for details see README.chromium. Two aff files couldn't
get parsed by Lucene's HunspellDictionary because of missing SET tags.
But I was able to fix that. At least I hope so, the parse exception
went away. Did not much testing though.

Great!.

The idea for future development is to create morpheme lists (finite
state transducer) on the fly while indexing words to ES and prepare
custom dict/aff files for spell check and autosuggestion.

That would be great to have. Especially one that is realtime!

Jörg

On Dec 30, 10:55 pm, Shay Banon kim...@gmail.com wrote:

This looks great!, especially with all the built in dicts. Did you
download
those from open office?

On Fri, Dec 30, 2011 at 9:23 AM, jprante joergpra...@gmail.com wrote:

Hi David,

I forgot to mention, right now, hunspell is used in Lucene as a token
filter for stemming.

"TokenFilter that uses hunspell affix rules and words to stem tokens.
Since hunspell supports a word having multiple stems, this filter can
emit multiple tokens for each consumed token."

http://lucene.apache.org/java/3_5_0/api/contrib-analyzers/org/apache/.
..

From a linguistic point of view, hunspell stemming is an improvement
over snowball stemming.

"Hunspell provides stemming for all languages that have OpenOffice
spellcheck dictionaries. Being dictionary based, it requires high
quality and well maintained dictionaries to work well for stemming -
in which case it may give more precise stemming than the Snowball
algorithms."

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

Jörg

On Dec 29, 10:17 pm, "da...@pilato.fr" da...@pilato.fr wrote:

Hi Jörg,

This feature looks great.
Correct me if I misunderstood: your plugin corrects spelling errors
before
indexing ?
And when a user runs a search, terms are modified (spell check)
before

running

the search ?

Is that it ?

Thanks
David.

Le 29 décembre 2011 à 21:45, "Jörg Prante" joergpra...@gmail.com a
écrit :

Hi,

because all of you are eager to keep up with Lucene 3.5 features, I
just

wrote an ElasticSearch Hunspell Analysis plugin.

Project URL:
https://github.com/jprante/elasticsearch-analysis-hunspell

For discussion, see
https://github.com/elasticsearch/elasticsearch/issues/646

Please note: included are hunspell dict/aff files from Chromium for
convenience. The license for the third-party files is a
tri-license MPL

1.1/GPL 2.0/LGPL.

Example usage:

index:
analysis:
filter:
hunspell_de:
type: hunspell
locale: de_DE
ignoreCase: true

Cheers,

Jörg

--
David Pilatohttp://dev.david.pilato.fr/
Twitter : @dadoonet


(Damien Hardy) #7

On 29 déc 2011, 21:45, Jörg Prante joergpra...@gmail.com wrote:

Hi,

because all of you are eager to keep up with Lucene 3.5 features, I just
wrote an ElasticSearch Hunspell Analysis plugin.

Project URL:https://github.com/jprante/elasticsearch-analysis-hunspell

For discussion, seehttps://github.com/elasticsearch/elasticsearch/issues/646

Please note: included are hunspell dict/aff files from Chromium for
convenience. The license for the third-party files is a tri-license MPL
1.1/GPL 2.0/LGPL.

Example usage:

index:
analysis:
filter:
hunspell_de:
type: hunspell
locale: de_DE
ignoreCase: true

Cheers,

Jörg

Hello,

Great job.
But installation proccess is not working..
We miss the compiled jar available for downloading form github to
install it on elasticsearch via the plugin utillity.

Best regards,

--
Damien


(Jörg Prante) #8

Thank you for pointing this out. I uploaded a zip file elasticsearch-
analysis-hunspell-1.0.0.zip to the github download area.

Best regards,

Jörg

On Jan 2, 1:49 pm, Damien Hardy damienhardy....@gmail.com wrote:

On 29 déc 2011, 21:45, Jörg Prante joergpra...@gmail.com wrote:

Hi,

because all of you are eager to keep up with Lucene 3.5 features, I just
wrote an ElasticSearch Hunspell Analysis plugin.

Project URL:https://github.com/jprante/elasticsearch-analysis-hunspell

For discussion, seehttps://github.com/elasticsearch/elasticsearch/issues/646

Please note: included are hunspell dict/aff files from Chromium for
convenience. The license for the third-party files is a tri-license MPL
1.1/GPL 2.0/LGPL.

Example usage:

index:
analysis:
filter:
hunspell_de:
type: hunspell
locale: de_DE
ignoreCase: true

Cheers,

Jörg

Hello,

Great job.
But installation proccess is not working..
We miss the compiled jar available for downloading form github to
install it on elasticsearch via the plugin utillity.

Best regards,

--
Damien


(Lukáš Vlček) #9

Hi Jörg,

I gave a hunspell plugin a try and have some doubts whether it can really
qualify as a stemmer. The problem I see with it is that it can emit way too
many different options for some terms (especially short one) that this can
IMO seriously harm the relevancy. I was testing it for the Czech language
but I guess the same situation is for some other languages as well (based
on my short test English seems to work a lot better).

I can clearly see benefit of hunspell as a spelling tool but stemmer? I am
not familiar with hunspell API but are there any options that can influence
the stemming process that might be useful to expose tinES plugin API as
well?

Regards,
Lukas

On Tue, Jan 3, 2012 at 9:38 AM, jprante joergprante@gmail.com wrote:

Thank you for pointing this out. I uploaded a zip file elasticsearch-
analysis-hunspell-1.0.0.zip to the github download area.

Best regards,

Jörg

On Jan 2, 1:49 pm, Damien Hardy damienhardy....@gmail.com wrote:

On 29 déc 2011, 21:45, Jörg Prante joergpra...@gmail.com wrote:

Hi,

because all of you are eager to keep up with Lucene 3.5 features, I
just

wrote an ElasticSearch Hunspell Analysis plugin.

Project URL:https://github.com/jprante/elasticsearch-analysis-hunspell

For discussion, seehttps://
github.com/elasticsearch/elasticsearch/issues/646

Please note: included are hunspell dict/aff files from Chromium for
convenience. The license for the third-party files is a tri-license MPL
1.1/GPL 2.0/LGPL.

Example usage:

index:
analysis:
filter:
hunspell_de:
type: hunspell
locale: de_DE
ignoreCase: true

Cheers,

Jörg

Hello,

Great job.
But installation proccess is not working..
We miss the compiled jar available for downloading form github to
install it on elasticsearch via the plugin utillity.

Best regards,

--
Damien


(Lukáš Vlček) #10

Just to give an illustration, there is a czech word "rada" which in given
context means "board" (but it can also mean "advice").
Hunspell with cs_CZ locale yields the following terms:

rada (board)
rada (the same term but I guess it is meant that this time it means advice)
raď (give advice - a verb)
radon (radon - a noun)

This really can not qualify as a stemmer.

Regards,
Lukas

On Thu, Jan 26, 2012 at 3:39 PM, Lukáš Vlček lukas.vlcek@gmail.com wrote:

Hi Jörg,

I gave a hunspell plugin a try and have some doubts whether it can really
qualify as a stemmer. The problem I see with it is that it can emit way too
many different options for some terms (especially short one) that this can
IMO seriously harm the relevancy. I was testing it for the Czech language
but I guess the same situation is for some other languages as well (based
on my short test English seems to work a lot better).

I can clearly see benefit of hunspell as a spelling tool but stemmer? I am
not familiar with hunspell API but are there any options that can influence
the stemming process that might be useful to expose tinES plugin API as
well?

Regards,
Lukas

On Tue, Jan 3, 2012 at 9:38 AM, jprante joergprante@gmail.com wrote:

Thank you for pointing this out. I uploaded a zip file elasticsearch-
analysis-hunspell-1.0.0.zip to the github download area.

Best regards,

Jörg

On Jan 2, 1:49 pm, Damien Hardy damienhardy....@gmail.com wrote:

On 29 déc 2011, 21:45, Jörg Prante joergpra...@gmail.com wrote:

Hi,

because all of you are eager to keep up with Lucene 3.5 features, I
just

wrote an ElasticSearch Hunspell Analysis plugin.

Project URL:
https://github.com/jprante/elasticsearch-analysis-hunspell

For discussion, seehttps://
github.com/elasticsearch/elasticsearch/issues/646

Please note: included are hunspell dict/aff files from Chromium for
convenience. The license for the third-party files is a tri-license
MPL

1.1/GPL 2.0/LGPL.

Example usage:

index:
analysis:
filter:
hunspell_de:
type: hunspell
locale: de_DE
ignoreCase: true

Cheers,

Jörg

Hello,

Great job.
But installation proccess is not working..
We miss the compiled jar available for downloading form github to
install it on elasticsearch via the plugin utillity.

Best regards,

--
Damien


(Jörg Prante) #11

Hi Lukáš,

thanks for pointing this out. Yes, it's true, the use of hunspell for
stemming must be carefully evaluated for each dictionary. See also the
warnings in

http://wiki.apache.org/solr/HunspellStemFilterFactory

Robert Muir gave caution about this in https://issues.apache.org/jira/browse/SOLR-2769

I assume the czeck dictionary I found in Chromium is not the best
choice.

To be honest, I am just in the process of learning to write
Elasticsearch plugins, and I started with a very tiny project. Most
attractive was a feature that appeared in Lucene 3.5, the hunspell
stem filter.

In a more advanced dictionary plugin I am busy with, I will use
hunspell dictionaries in the more appropriate way, that is, for spell
suggestions.

Best regard,

Jörg

On Jan 26, 3:47 pm, Lukáš Vlček lukas.vl...@gmail.com wrote:

Just to give an illustration, there is a czech word "rada" which in given
context means "board" (but it can also mean "advice").
Hunspell with cs_CZ locale yields the following terms:

rada (board)
rada (the same term but I guess it is meant that this time it means advice)
raď (give advice - a verb)
radon (radon - a noun)

This really can not qualify as a stemmer.

Regards,
Lukas

On Thu, Jan 26, 2012 at 3:39 PM, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi Jörg,

I gave a hunspell plugin a try and have some doubts whether it can really
qualify as a stemmer. The problem I see with it is that it can emit way too
many different options for some terms (especially short one) that this can
IMO seriously harm the relevancy. I was testing it for the Czech language
but I guess the same situation is for some other languages as well (based
on my short test English seems to work a lot better).

I can clearly see benefit of hunspell as a spelling tool but stemmer? I am
not familiar with hunspell API but are there any options that can influence
the stemming process that might be useful to expose tinES plugin API as
well?

Regards,
Lukas

On Tue, Jan 3, 2012 at 9:38 AM, jprante joergpra...@gmail.com wrote:

Thank you for pointing this out. I uploaded a zip file elasticsearch-
analysis-hunspell-1.0.0.zip to the github download area.

Best regards,

Jörg

On Jan 2, 1:49 pm, Damien Hardy damienhardy....@gmail.com wrote:

On 29 déc 2011, 21:45, Jörg Prante joergpra...@gmail.com wrote:

Hi,

because all of you are eager to keep up with Lucene 3.5 features, I
just

wrote an ElasticSearch Hunspell Analysis plugin.

Project URL:
https://github.com/jprante/elasticsearch-analysis-hunspell

For discussion, seehttps://
github.com/elasticsearch/elasticsearch/issues/646

Please note: included are hunspell dict/aff files from Chromium for
convenience. The license for the third-party files is a tri-license
MPL

1.1/GPL 2.0/LGPL.

Example usage:

index:
analysis:
filter:
hunspell_de:
type: hunspell
locale: de_DE
ignoreCase: true

Cheers,

Jörg

Hello,

Great job.
But installation proccess is not working..
We miss the compiled jar available for downloading form github to
install it on elasticsearch via the plugin utillity.

Best regards,

--
Damien


(Lukáš Vlček) #12

Sounds cool, and thanks for the effort!
Lukas

On Thu, Jan 26, 2012 at 5:32 PM, jprante joergprante@gmail.com wrote:

Hi Lukáš,

thanks for pointing this out. Yes, it's true, the use of hunspell for
stemming must be carefully evaluated for each dictionary. See also the
warnings in

http://wiki.apache.org/solr/HunspellStemFilterFactory

Robert Muir gave caution about this in
https://issues.apache.org/jira/browse/SOLR-2769

I assume the czeck dictionary I found in Chromium is not the best
choice.

To be honest, I am just in the process of learning to write
Elasticsearch plugins, and I started with a very tiny project. Most
attractive was a feature that appeared in Lucene 3.5, the hunspell
stem filter.

In a more advanced dictionary plugin I am busy with, I will use
hunspell dictionaries in the more appropriate way, that is, for spell
suggestions.

Best regard,

Jörg

On Jan 26, 3:47 pm, Lukáš Vlček lukas.vl...@gmail.com wrote:

Just to give an illustration, there is a czech word "rada" which in given
context means "board" (but it can also mean "advice").
Hunspell with cs_CZ locale yields the following terms:

rada (board)
rada (the same term but I guess it is meant that this time it means
advice)
raď (give advice - a verb)
radon (radon - a noun)

This really can not qualify as a stemmer.

Regards,
Lukas

On Thu, Jan 26, 2012 at 3:39 PM, Lukáš Vlček lukas.vl...@gmail.com
wrote:

Hi Jörg,

I gave a hunspell plugin a try and have some doubts whether it can
really

qualify as a stemmer. The problem I see with it is that it can emit
way too

many different options for some terms (especially short one) that this
can

IMO seriously harm the relevancy. I was testing it for the Czech
language

but I guess the same situation is for some other languages as well
(based

on my short test English seems to work a lot better).

I can clearly see benefit of hunspell as a spelling tool but stemmer?
I am

not familiar with hunspell API but are there any options that can
influence

the stemming process that might be useful to expose tinES plugin API as
well?

Regards,
Lukas

On Tue, Jan 3, 2012 at 9:38 AM, jprante joergpra...@gmail.com wrote:

Thank you for pointing this out. I uploaded a zip file elasticsearch-
analysis-hunspell-1.0.0.zip to the github download area.

Best regards,

Jörg

On Jan 2, 1:49 pm, Damien Hardy damienhardy....@gmail.com wrote:

On 29 déc 2011, 21:45, Jörg Prante joergpra...@gmail.com wrote:

Hi,

because all of you are eager to keep up with Lucene 3.5 features,
I

just

wrote an ElasticSearch Hunspell Analysis plugin.

Project URL:
https://github.com/jprante/elasticsearch-analysis-hunspell

For discussion, seehttps://
github.com/elasticsearch/elasticsearch/issues/646

Please note: included are hunspell dict/aff files from Chromium
for

convenience. The license for the third-party files is a
tri-license

MPL

1.1/GPL 2.0/LGPL.

Example usage:

index:
analysis:
filter:
hunspell_de:
type: hunspell
locale: de_DE
ignoreCase: true

Cheers,

Jörg

Hello,

Great job.
But installation proccess is not working..
We miss the compiled jar available for downloading form github to
install it on elasticsearch via the plugin utillity.

Best regards,

--
Damien


(system) #13