[ANN] Elasticsearch Analysis Baseform plugin 1.1.0


(Jörg Prante) #1

Hi,

version 1.1.0 of my lemmatization plugin has now english baseforms included.

An example is included in the README.

Credits go to http://languagetool.org/ for providing the dictionary.

More info and download:

Cheers,

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Lukáš Vlček) #2

Hi Jörg,

nice work!

One question: I did not check the source code but from the docs it seems
that you allow for use of dict files, for example for english. Dict files
generally allow for case sensitive words which can be very useful. Is there
any recommendation how to use your plugin in combination with lowercase
filter and still get the full power of case sensitive dictionary?

For example there can be names (can contain upper cases) in the dictionary
that can match "ordinary" words (all small cases) but different folding
rules apply on each of them. On the other hand words at the beginning of
the sentence typically start with upper case but we need to apply lowercase
before we can find match in the dictionary. Is there any mechanism
implemented in your plugin to deal with these situations correctly?

Also does your plugin handle correctly words which need folding on both
ends the pre and post fix? For english the example would be "nonuniqueness"
(probably not the best example, but you get the idea, right?).

Regards,
Lukas

On Sat, Nov 16, 2013 at 12:03 AM, joergprante@gmail.com <
joergprante@gmail.com> wrote:

Hi,

version 1.1.0 of my lemmatization plugin has now english baseforms
included.

An example is included in the README.

Credits go to http://languagetool.org/ for providing the dictionary.

More info and download:

https://github.com/jprante/elasticsearch-analysis-baseform

Cheers,

Jörg

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Lukáš Vlček) #3

Update: quickly looked at the dictionary files in your repo, it seems these
are plain translation tables and no affix rules are applied (tables are
created by expanding original affix rules?), is that correct?

One more question: can your plugin output more then one token for one input
token?

Thanks,
Lukas

On Sat, Nov 16, 2013 at 8:41 AM, Lukáš Vlček lukas.vlcek@gmail.com wrote:

Hi Jörg,

nice work!

One question: I did not check the source code but from the docs it seems
that you allow for use of dict files, for example for english. Dict files
generally allow for case sensitive words which can be very useful. Is there
any recommendation how to use your plugin in combination with lowercase
filter and still get the full power of case sensitive dictionary?

For example there can be names (can contain upper cases) in the dictionary
that can match "ordinary" words (all small cases) but different folding
rules apply on each of them. On the other hand words at the beginning of
the sentence typically start with upper case but we need to apply lowercase
before we can find match in the dictionary. Is there any mechanism
implemented in your plugin to deal with these situations correctly?

Also does your plugin handle correctly words which need folding on both
ends the pre and post fix? For english the example would be "nonuniqueness"
(probably not the best example, but you get the idea, right?).

Regards,
Lukas

On Sat, Nov 16, 2013 at 12:03 AM, joergprante@gmail.com <
joergprante@gmail.com> wrote:

Hi,

version 1.1.0 of my lemmatization plugin has now english baseforms
included.

An example is included in the README.

Credits go to http://languagetool.org/ for providing the dictionary.

More info and download:

https://github.com/jprante/elasticsearch-analysis-baseform

Cheers,

Jörg

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #4

On Sat, Nov 16, 2013 at 8:41 AM, Lukáš Vlček lukas.vlcek@gmail.com wrote:

Hi Jörg,

nice work!

One question: I did not check the source code but from the docs it seems
that you allow for use of dict files, for example for english. Dict files
generally allow for case sensitive words which can be very useful. Is there
any recommendation how to use your plugin in combination with lowercase
filter and still get the full power of case sensitive dictionary?

Upper and lower case are different, there is no lowercase conversion in the
baseform analysis, it is case sensitive.

For example there can be names (can contain upper cases) in the dictionary
that can match "ordinary" words (all small cases) but different folding
rules apply on each of them. On the other hand words at the beginning of
the sentence typically start with upper case but we need to apply lowercase
before we can find match in the dictionary. Is there any mechanism
implemented in your plugin to deal with these situations correctly?

No, not really. There is no POS tagging (yet), the baseform analysis can
not detect the meaning of whole sentences. The dictionary can have more
entries for a word but only the first one is used, for simplicity. This may
not work correctly in all situations and will be improved in future
versions.

Also does your plugin handle correctly words which need folding on both
ends the pre and post fix? For english the example would be "nonuniqueness"
(probably not the best example, but you get the idea, right?).

nonuniqueness is a compound word. Decompounding is not an easy linguistic
task. In the baseform dictionary, there is no entry for "nonuniqueness" so
it is assumed a word already in base form. Either an entry could be added
or maybe one time the decompounder plugin can work on english words to
separate "nonuniqueness" into "non" and "uniqueness". After this, the
baseform plugin could reduce the word parts. It is a very common case in
german but there are pitfalls because meaning of words can change by simple
lexical decompounding, example: "Lehrkraft" (teacher) -> "Lehr" (teach) +
"kraft" (power). Therefore my decompounding plugin works with a list of
trained decompositions, not with a dictionary.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #5

On Sat, Nov 16, 2013 at 8:58 AM, Lukáš Vlček lukas.vlcek@gmail.com wrote:

Update: quickly looked at the dictionary files in your repo, it seems
these are plain translation tables and no affix rules are applied (tables
are created by expanding original affix rules?), is that correct?

I dumped the english.dict dictionary of LanguageTool.org into plain text
and sorted the word list.

The source of english.dict can be found at

From the english.info file, I understand that no affix rules were applied.

For the baseform analysis, I extracted nouns and verbs from the word list.
There are both regular forms (where stemming algorithms work well) and
irregular forms ("went" -> "go") in the dictionary as well.

One more question: can your plugin output more then one token for one
input token?

It could do, by modifying this if statement into a while statement when
reading out the FST result

https://github.com/jprante/elasticsearch-analysis-baseform/blob/master/src/main/java/org/xbib/elasticsearch/analysis/baseform/Dictionary.java#L62

and more importantly, by adding an algorithm which can decide about what
result tokens are to be used.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Lukáš Vlček) #6

Hi Jörg,

I know very little about what the LanguageTool thing does (their web
example looks indeed promising) but from my naive understanding and based
on the quick look through their repo it seems that most of the dictionaries
are based on some *spell dictionaries (MySpell, ASpell, iSpell, Hunspell)
and as such their original representation was in form of dic and aff
files (dic = dictionary of base word forms plus codes of valid affix rules,
aff = pre+post fix rules describing how the based word can be modified to
get other word forms). Saying that what is really an advantage over
existing Hunspell token filter then?

Also isn't the "expanded" dictionary representation always larger (at least
for highly infected languages) compared to when it is represented by two
files: dic and aff?

May be what seems to be different is that they combined various dictionary
sources into one database? Like for german language they merged AT, DE, CH
[1], for english CA, GB, NZ, US, ZA [2] … ? Just speculating. But if they
did this, is it really a valid operation from language dictionary
perspective?

[1]


[2]

May be I am just mislead by presence of the hunspell folder in their
resources and by the fact that I was not able to find quickly a clear
explanation about how they create the final dictionary and how it compares
to existing dictionaries.

Regards,
Lukas

On Sat, Nov 16, 2013 at 11:44 AM, joergprante@gmail.com <
joergprante@gmail.com> wrote:

On Sat, Nov 16, 2013 at 8:58 AM, Lukáš Vlček lukas.vlcek@gmail.comwrote:

Update: quickly looked at the dictionary files in your repo, it seems
these are plain translation tables and no affix rules are applied (tables
are created by expanding original affix rules?), is that correct?

I dumped the english.dict dictionary of LanguageTool.org into plain text
and sorted the word list.

The source of english.dict can be found at
https://github.com/languagetool-org/languagetool/tree/master/languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en

From the english.info file, I understand that no affix rules were applied.

For the baseform analysis, I extracted nouns and verbs from the word list.
There are both regular forms (where stemming algorithms work well) and
irregular forms ("went" -> "go") in the dictionary as well.

One more question: can your plugin output more then one token for one
input token?

It could do, by modifying this if statement into a while statement when
reading out the FST result

https://github.com/jprante/elasticsearch-analysis-baseform/blob/master/src/main/java/org/xbib/elasticsearch/analysis/baseform/Dictionary.java#L62

and more importantly, by adding an algorithm which can decide about what
result tokens are to be used.

Jörg

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #7

On Sun, Nov 17, 2013 at 2:39 PM, Lukáš Vlček lukas.vlcek@gmail.com wrote:

Hi Jörg,

I know very little about what the LanguageTool thing does (their web
example looks indeed promising) but from my naive understanding and based
on the quick look through their repo it seems that most of the dictionaries
are based on some *spell dictionaries (MySpell, ASpell, iSpell, Hunspell)
and as such their original representation was in form of dic and aff
files (dic = dictionary of base word forms plus codes of valid affix rules,
aff = pre+post fix rules describing how the based word can be modified to
get other word forms). Saying that what is really an advantage over
existing Hunspell token filter then?

Hunspell token filter is very slow, morfologik FSA is a factor of about
10x-100x faster.

Also isn't the "expanded" dictionary representation always larger (at
least for highly infected languages) compared to when it is represented by
two files: dic and aff?

I never compared the representations, it differs much between languages.
The morfologik FSA is very compact, see
http://wiki.languagetool.org/hunspell-support

May be what seems to be different is that they combined various dictionary
sources into one database? Like for german language they merged AT, DE, CH
[1], for english CA, GB, NZ, US, ZA [2] … ? Just speculating. But if they
did this, is it really a valid operation from language dictionary
perspective?

It depends. Adding dialects into one file is ok as long as word variants do
not overwrite each other. For spell check, there is a reason for the
country variants, some words appear in one variant but not in the other.
For baseform, I think it does not matter, and a single baseform dictionary
per language is ok.

[1]
https://github.com/languagetool-org/languagetool/tree/master/languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell
[2]
https://github.com/languagetool-org/languagetool/tree/master/languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell

May be I am just mislead by presence of the hunspell folder in their
resources and by the fact that I was not able to find quickly a clear
explanation about how they create the final dictionary and how it compares
to existing dictionaries.

It looks like they dumped hunspell dicts and filled morfologik FSA with it.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Lukáš Vlček) #8

Thanks Jörg,

the reason why I asked all those questions is that I see a lot of overlap
with Lucene Hunspell (overlap in terms of provided functionality in
analysis, not in terms of internal implementation). So I am glad I now
understand better how these two things compare.

FSA might be faster, on the other hand Lucene does not rely on hunspellJNA,
it has its own implementation of dictionary traversal and lookup. So may be
it is not that bad (I would not consider the performance comparison stated
on LanguageTool wiki page relevant in this case). And if it is really that
much slower then it might be useful to improve Lucene implementation (I do
not see reason why Lucene impl could not build FSA internally as well).

Just my 2 cents,
Lukáš

On Sun, Nov 17, 2013 at 5:28 PM, joergprante@gmail.com <
joergprante@gmail.com> wrote:

On Sun, Nov 17, 2013 at 2:39 PM, Lukáš Vlček lukas.vlcek@gmail.comwrote:

Hi Jörg,

I know very little about what the LanguageTool thing does (their web
example looks indeed promising) but from my naive understanding and based
on the quick look through their repo it seems that most of the dictionaries
are based on some *spell dictionaries (MySpell, ASpell, iSpell, Hunspell)
and as such their original representation was in form of dic and aff
files (dic = dictionary of base word forms plus codes of valid affix rules,
aff = pre+post fix rules describing how the based word can be modified to
get other word forms). Saying that what is really an advantage over
existing Hunspell token filter then?

Hunspell token filter is very slow, morfologik FSA is a factor of about
10x-100x faster.

Also isn't the "expanded" dictionary representation always larger (at
least for highly infected languages) compared to when it is represented by
two files: dic and aff?

I never compared the representations, it differs much between languages.
The morfologik FSA is very compact, see
http://wiki.languagetool.org/hunspell-support

May be what seems to be different is that they combined various
dictionary sources into one database? Like for german language they merged
AT, DE, CH [1], for english CA, GB, NZ, US, ZA [2] … ? Just speculating.
But if they did this, is it really a valid operation from language
dictionary perspective?

It depends. Adding dialects into one file is ok as long as word variants
do not overwrite each other. For spell check, there is a reason for the
country variants, some words appear in one variant but not in the other.
For baseform, I think it does not matter, and a single baseform dictionary
per language is ok.

[1]
https://github.com/languagetool-org/languagetool/tree/master/languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell
[2]
https://github.com/languagetool-org/languagetool/tree/master/languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell

May be I am just mislead by presence of the hunspell folder in their
resources and by the fact that I was not able to find quickly a clear
explanation about how they create the final dictionary and how it compares
to existing dictionaries.

It looks like they dumped hunspell dicts and filled morfologik FSA with it.

Jörg

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #9

Lucene uses morfologik FSA, for a polish analyzer
http://lucene.apache.org/core/4_5_1/analyzers-morfologik/index.html There
is no reason the morfologik analyzer could not be extended to other
languages.

Fact is that hunspell has its weaknesses (for example finnish, that is the
reason why there is hfst and voikko) . Hunspell quality does heavily depend
on the provided dictionaries. Also if dicts are too large, hunspell
approach simply fails because speed is too low and mem usage exceeds all
reasonable sizes.

For example https://issues.apache.org/jira/browse/SOLR-3245

From the point of resource usage and performance, morfologik FSA is a good
alternative.

Baseform reduction (lemmatization) is also a different task from spell
checking, the task where hunspell dicts are used for, Lucene analysis using
hunspell is just a side effect.

Baseforms work only on correct words where hunspell can also recognize
misspelled words.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #10