Search Match for all tokens from decompound filter

florin_olah · January 4, 2023, 3:34pm

Hello,

I am trying to do the same thing described in the topic here: German compound words in an e-commerce search

simple example: searching for "sprachkurs" which is tokenized as "sprachkurs, sprach, kurs"
Desired result is to show ONLY results that contain ALL tokens. Right now results contain "sprach" OR "kurs".

Searching and indexing are using the same tokenizers. (lowercase, german normalization and dictionary decompounder).

thaarbach · January 11, 2023, 5:23pm

Hi Florin,

Sorry for the delayed reply, I was on vacation.
We solved this by using an additional analyzer for the search (search_analyzer). It is important that this analyzer does not use a decomponder. The background is that when indexing with the decompounder (indexing_analyzer), the tokens are stored in the inverted index. In case of "Sprachkurs", for example, that should be Sprachkurs, sprach, kurs, depending on used dictionary
If you now search for a "Sprachkurs", you will find all documents containing a "Sprachkurs". If you only search for "sprach" or "kurs", documents containing "Sprachkurs" are also included.

Additional token filters such as synonym-graph, ngram, etc. can improve the search result even further.

Regards Thomas

florin_olah · January 17, 2023, 12:36pm

Thank you for the reply !

Maybe i am doing something wrong,

I tried to implement a different basic analyzer(just lowercase + german normalization) for search, without decompounder but it is not working as expected.

If i search for "sprachkurs" i would expect it to show results that include "sprach" and "kurs" not just the exact match "sprachkurs". So i have different number of results when searching for "sprachkurs" and "sprach kurs".

Same example for kabel kanal , Kabelkanal, i would like for the results to be the same.

florin_olah · February 1, 2023, 10:03am

For anyone wondering how i implemented it, i ended up doing an analyze request first to get the tokens.

So:

get user query, analyze it and get the tokens from the decompounder (lowercase, german normalization, dictionary/hyphenator decompound)
use BOTH user query and tokens to perform the ES query

do a broad MUST match on tokens so we have all results using AND operator so all tokens must exist in the result.
BOOST exact phrase match on actual user query on fields that do NOT use the tokenizer
BOOST only the tokens in exact order. So user search for "kabelcanal' will result in tokens "kabel" and "canal" and we boost "kabel canal" phrase exact match

system · March 1, 2023, 10:03am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Decompounder in query_string analyzer Elasticsearch	1	682	July 6, 2017
How do I build a query such that each token in a document field is matched? Elasticsearch	12	2017	July 6, 2017
Phrase Query breaks with "Compound Word Token Filters" Elasticsearch	6	1127	August 13, 2018
Need suggestions on type of query to be used for a given analysis for better results? Elasticsearch	2	384	July 6, 2017
Manipulate token positions in analyzer Elasticsearch	1	402	July 5, 2017

Search Match for all tokens from decompound filter

Related topics