Hi Jörg,
I know very little about what the LanguageTool thing does (their web
example looks indeed promising) but from my naive understanding and based
on the quick look through their repo it seems that most of the dictionaries
are based on some *spell dictionaries (MySpell, ASpell, iSpell, Hunspell)
and as such their original representation was in form of dic
and aff
files (dic = dictionary of base word forms plus codes of valid affix rules,
aff = pre+post fix rules describing how the based word can be modified to
get other word forms). Saying that what is really an advantage over
existing Hunspell token filter then?
Also isn't the "expanded" dictionary representation always larger (at least
for highly infected languages) compared to when it is represented by two
files: dic and aff?
May be what seems to be different is that they combined various dictionary
sources into one database? Like for german language they merged AT, DE, CH
[1], for english CA, GB, NZ, US, ZA [2] … ? Just speculating. But if they
did this, is it really a valid operation from language dictionary
perspective?
[1]
[2]
May be I am just mislead by presence of the hunspell folder in their
resources and by the fact that I was not able to find quickly a clear
explanation about how they create the final dictionary and how it compares
to existing dictionaries.
Regards,
Lukas
On Sat, Nov 16, 2013 at 11:44 AM, joergprante@gmail.com <
joergprante@gmail.com> wrote:
On Sat, Nov 16, 2013 at 8:58 AM, Lukáš Vlček lukas.vlcek@gmail.comwrote:
Update: quickly looked at the dictionary files in your repo, it seems
these are plain translation tables and no affix rules are applied (tables
are created by expanding original affix rules?), is that correct?
I dumped the english.dict dictionary of LanguageTool.org into plain text
and sorted the word list.
The source of english.dict can be found at
https://github.com/languagetool-org/languagetool/tree/master/languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en
From the english.info file, I understand that no affix rules were applied.
For the baseform analysis, I extracted nouns and verbs from the word list.
There are both regular forms (where stemming algorithms work well) and
irregular forms ("went" -> "go") in the dictionary as well.
One more question: can your plugin output more then one token for one
input token?
It could do, by modifying this if statement into a while statement when
reading out the FST result
https://github.com/jprante/elasticsearch-analysis-baseform/blob/master/src/main/java/org/xbib/elasticsearch/analysis/baseform/Dictionary.java#L62
and more importantly, by adding an algorithm which can decide about what
result tokens are to be used.
Jörg
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.