Stemmer for keyword-fields before index

I have field "Title". I use this field for aggregations, including building in the Kibana Tags Cloud, so the field is declared as a keyword.
In this field, I write the title text, where each word is an array element.

There is a problem when constructing a tags cloud: different forms of words are displayed as separate elements. For example, fox and foxes.
In this case, I could be helped by using stemming before indexing so that the stemming result is displayed in the index.

Is it possible to use Elastic-stemmer before indexing a field? I would like to store only stems of words in the Title field. I looked at the keyword analyzer, normalizer (stemmer not support), significant text aggregation (i think its can works, but not sure) and looked for other options, but finally got confused.

Or should I implement stemming preprocessing myself as a third-party solution outside of Elastic?

There's a Kibana issue open to add support for significant_text. Feel free to upvote.

In the mean-time the analyze api may be of some use turning your text into arrays of keywords.

1 Like

For the test, I used an analyzer (/index/_analyze) consisting of lowercase and stemmer; when sending a test message, it gave the correct result in answer. After that i trying to assign this analyzer to the keyword field (according to the documentation), but the original string was still presented in the index, without using stemming.
I understand correctly that the analyzer is simply applied to the message, but the initial fields are still put into the index? Or can I write the result of the analyzer to the original field?

Yes, then I tested the search, and since this is a key field, it only looked for an exact match.
Perhaps I did something wrong or I don't understand something
All I would like to do is apply a stemmer before indexing so that the result of its work is stored in the index in a given field.


Keyword fields don't use Analyzers. They can only use normalizers which offer a subset of the usual text-processing logic e.g. lower-casing but not stemming, synonyms, tokenising etc.

My suggestion was you take these "correct results" and assemble them into string arrays for your title field which is mapped as a keyword. It's kind of perverse because you're doing in your client exactly what the server tries to prevent in not tokenising keyword strings into multiple values. In this case we justify it because the titles are hopefully short and it's not a lengthy article.

Ideally Kibana would support the significant_text aggregation and you wouldn't have to do all this.

1 Like

Yes, but i used this documentation and redeclared the analyzer. The request was successful (when I tried to declare the stemmer in the normalizer, the request was failed, which is understandable), maybe it just doesn't work.

Where should I look to automate this process to make it a single request? Or should I call the parse command myself and then index the record with the update?
Yes, these are short titles. I understand that I am reinventing the wheel, but this is already a matter of interest.
In fact, my task is to find some trends (I am guided by words, for example "covid") among the headlines. What i have now works, but I don't like the repetition of different word forms.
As a result, I am now bound to the keyword type because of the aggregation for Tags Cloud (or its similarity in a pure query).

Yes, thanks, this should give the correct result (according to tests, it looks like the question is what will be on real data). I read the description and tried to use it, it sounds good - but what is inside I did not really understand. So the questions above are rather a question of self-development.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.