Force token filter to output just one token

Hi,

I am working on the devlopment of the search solution based on Elasticsearch for Polish language. Initially, I used the recommended analysis-stempel plugin to stem Polish words but after some experiments I found it not ideal for quite a few words important for my search case.

Then, I found an another plugin created specifically for the Polish language: GitHub - allegro/elasticsearch-analysis-morfologik: Morfologik Polish Lemmatizer plugin for Elasticsearch. Based on some tests, it yields better results for my use case. However, it (its token filter) often outputs more than one output token per each input token (word). This is because it tries to output the word in both basic masculine and feminine form. For example, the word "czerwoną" gets stemmed into "czerwona" and "czerwony".

I don't want that behaviour because it unnecessarily creates redundancy and negatively impacts the performance of my search queries.

Is there any way to limit the number of tokens output per one input token? Taking just the first output token would fit all my needs.

Thanks a lot for your help!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.