Restrict the Number of Unique term to be indexed for a Document

Hi , I have a requirement wherein I need to restrict the number of unique terms to Index for any Document to 1000. Any unique terms beyond 1000 for a particular document should be ignored and not Index during Indexing. Is there any such setting in the Index template or some other setting for this requirement? This is a legacy requirement.
I was searching for any such settings in Elastic , but could not.
Any help is appreciated.

Does ignore_above | Elasticsearch Guide [8.7] | Elastic help?

ignore_above edit

Strings longer than the ignore_above setting will not be indexed or stored. For arrays of strings, ignore_above will be applied for each array element separately and string elements longer than ignore_above will not be indexed or stored.

ignore_above : is to do with size of indivisual terms.
I am looking for unique terms in the content. Lets say , a story contains 2000 unique terms. I want only the first 1000 terms to be indexed and rest 1000 terms to be ignored.

Ah ok, then you will need to do that as Elasticsearch only tracks unique terms in its inverted index, it doesn't expose that anywhere else for you to be able to use in analyses unfortunately.

I do not think there is anything built in that will do that. You would probably need to create your own analysis plugin.

Limiting the number of unique terms may result in missing out on useful terms. Why would you want to potentially reduce the quality of search like this?

"filter": {
"max_token_count" : {
"type": "limit",
"max_token_count": 1000
},
Can restrict the number of token to 1000, but it cannot identify the unique terms.

1 Like

You can may be use the _analyze API as a step before creating the document.
Analyzing the text will produce all the tokens. Then you can do some json magic to group the tokens all together and keep only the ones you want...

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.