simple example: searching for "sprachkurs" which is tokenized as "sprachkurs, sprach, kurs"
Desired result is to show ONLY results that contain ALL tokens. Right now results contain "sprach" OR "kurs".
Searching and indexing are using the same tokenizers. (lowercase, german normalization and dictionary decompounder).
Sorry for the delayed reply, I was on vacation.
We solved this by using an additional analyzer for the search (search_analyzer). It is important that this analyzer does not use a decomponder. The background is that when indexing with the decompounder (indexing_analyzer), the tokens are stored in the inverted index. In case of "Sprachkurs", for example, that should be Sprachkurs, sprach, kurs, depending on used dictionary
If you now search for a "Sprachkurs", you will find all documents containing a "Sprachkurs". If you only search for "sprach" or "kurs", documents containing "Sprachkurs" are also included.
Additional token filters such as synonym-graph, ngram, etc. can improve the search result even further.
I tried to implement a different basic analyzer(just lowercase + german normalization) for search, without decompounder but it is not working as expected.
If i search for "sprachkurs" i would expect it to show results that include "sprach" and "kurs" not just the exact match "sprachkurs". So i have different number of results when searching for "sprachkurs" and "sprach kurs".
Same example for kabel kanal , Kabelkanal, i would like for the results to be the same.
For anyone wondering how i implemented it, i ended up doing an analyze request first to get the tokens.
So:
get user query, analyze it and get the tokens from the decompounder (lowercase, german normalization, dictionary/hyphenator decompound)
use BOTH user query and tokens to perform the ES query
do a broad MUST match on tokens so we have all results using AND operator so all tokens must exist in the result.
BOOST exact phrase match on actual user query on fields that do NOT use the tokenizer
BOOST only the tokens in exact order. So user search for "kabelcanal' will result in tokens "kabel" and "canal" and we boost "kabel canal" phrase exact match
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.