Dictionary_decompounder needs lowercase word lists if "lowercase" filter is used for query -> gotcha/bug?

Felix_Schwarz · June 10, 2016, 11:06am

I noticed some behavior which was very confusing to me and it took me a bit of time to find out the root cause. Now I'm looking for advise if this is a widely accepted "gotcha" or might be even considered a bug.

Basically I want to use ElasticSearch's "dictionary_decompounder" and "hyphenation_decompounder" filters (ElasticSearch 2.3.3). Both get a word list as input. Well, there are a lot of word files available so getting the data is no problem.

Now I also treat searches case-insensitively ("lowercase" filter). However I noticed that decompounding does not work UNLESS all items in the the word list/file are also lower-cased.

Somehow I can understand that requirement (it is similar to using the same tokenization for indexing and searching) but anyway it took my quite some time to realize. I think this should be mentioned in the docs at least (gotcha) but I'm not sure what phrasing would be best.

Ideally I would like ElasticSearch to process my word lists so I can just drop in unmodified files from external sources without preprocessing but that seems to be a minor issue.

Any suggestions how I should proceed? As I never contributed to ElasticSearch so far some advise about the bug reporting process/documentation enhancements are appreciated.
Felix

(I suspect the problem did not occur to many users so far as it seems only German uses compound words AND capitalized nouns - unfortunately the majority of compound words in German are made of nouns. So capitalization is a pretty big deal here.)

Jan_Apel · July 11, 2016, 9:37am

I noticed the same behavior... It would be awesome if there was a setting like for the Hunspell Token Filter (https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-hunspell-tokenfilter.html) to just ignore_case

Topic		Replies	Views
Requesting help with Case-insensitive Analyzer Elasticsearch	3	320	March 27, 2024
Case Insensitive Term Filters Elasticsearch	2	1617	July 6, 2017
Search Match for all tokens from decompound filter Elasticsearch	4	430	March 1, 2023
Multimatch with CROSS_FIELD query and decompounder Elasticsearch	2	409	March 14, 2022
Upper case and lower case string giving different results for filter Elasticsearch	3	2474	November 25, 2019

Dictionary_decompounder needs lowercase word lists if "lowercase" filter is used for query -> gotcha/bug?

Related topics