This is producing the same result. So please try to keep examples as simple as possible. That helps to get a faster response.
Have a look at this:
POST telephone_book/_analyze
{
"analyzer": "german",
"text": ["Muller"]
}
POST telephone_book/_analyze
{
"analyzer": "german",
"text": ["Müller"]
}
POST telephone_book/_analyze
{
"analyzer": "german",
"text": ["Mueller"]
}
POST telephone_book/_analyze
{
"analyzer": "german",
"text": ["Muell"]
}
If you run this, you will exactly which are the tokens generated at index time and at search time. That will help you to understand why only Mueller is matching when searching for Muell.
Using "explain": true will give you all the transformations that are happening when using an analyzer:
POST telephone_book/_analyze
{
"explain": true,
"analyzer": "german",
"text": ["Muller"]
}
Thanks for the tip with the post analyze, that is really interesting.
Sorry for my "complicated" example but I just wanted to make clear how my setup is (with a nested path) and I didn't know if that may be the problem.
So, back to my problem: I searched a lot about this topic and after a week still can't come to a conclusion how to setup my index that searching for "ue" will match an "ü". Some posts talk about a snowball-plugin, I tried to use all the different Stemmer (german, german2, light_german, minimal_german) and other posts are just so old that I just wanted a confirm that this is still the right way to do it.
If the answer is: You just CAN'T receive "ü"-results when searching for "ue" then I (and especially my customers) will have to accept this. But I just wanted to see if somehow there is a way to achieve this.
Yeah it is a bug that popped up @ my workplace. Another coworker said that he barely sees a working search for umlauts in other addressbooks like Outlook etc.
So as I said, if nothing works (I will try out the synonym token filter) a "no" is also an answer I will accept^^
'ä', 'ö', 'ü' are replaced by 'a', 'o', 'u', respectively.
'ae' and 'oe' are replaced by 'a', and 'o', respectively.
'ue' is replaced by 'u', when not following a vowel or q.
Let's try to put this together the right way. You want to tokenize, normalize (optionally stem, but this doesn't make a difference for this example so I've left it out), and only build the ngrams at the very end. You've basically cut out the ngrams too early, so the normalization didn't work:
This should work and you can also check the ngrams with the _analyze endpoint.
If you need to customize those rules (for example 'ue' should always be replaced by 'u') you'll need to write your own char filter; probably a mapping char filter.
Thanks, that works like a charm! Using an ngram-filter instead of a tokenizer, wow!
Thanks to the "_analyze" and "explain" I understand better know what is happening and also found out that if using a stemmer, I should add it AFTER the ngram-filter, otherwise my "ends with"-search won't work.
I'm not sure if stemming and ngrams make much sense together. I'd analyze a field multiple times and then search over all of them to get the best results.
PS: Something that you might want to include for German are decompounders, unless this is already sufficiently covered by ngrams.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.