I noticed some behavior which was very confusing to me and it took me a bit of time to find out the root cause. Now I'm looking for advise if this is a widely accepted "gotcha" or might be even considered a bug.
Basically I want to use ElasticSearch's "dictionary_decompounder" and "hyphenation_decompounder" filters (ElasticSearch 2.3.3). Both get a word list as input. Well, there are a lot of word files available so getting the data is no problem.
Now I also treat searches case-insensitively ("lowercase" filter). However I noticed that decompounding does not work UNLESS all items in the the word list/file are also lower-cased.
Somehow I can understand that requirement (it is similar to using the same tokenization for indexing and searching) but anyway it took my quite some time to realize. I think this should be mentioned in the docs at least (gotcha) but I'm not sure what phrasing would be best.
Ideally I would like ElasticSearch to process my word lists so I can just drop in unmodified files from external sources without preprocessing but that seems to be a minor issue.
Any suggestions how I should proceed? As I never contributed to ElasticSearch so far some advise about the bug reporting process/documentation enhancements are appreciated.
Felix
(I suspect the problem did not occur to many users so far as it seems only German uses compound words AND capitalized nouns - unfortunately the majority of compound words in German are made of nouns. So capitalization is a pretty big deal here.)