Is there a way to get list of ‘unique’ words Or terms used in a large text field in ALL the documents ? for ex. if the text field has content like ‘It is the most confidential because confidential or transcendental knowledge involves understanding the difference between the soul and the body’ then the output should be something like -
confidential, transcendental, knowledge, understanding, soul, body
Yes, but it’s not straightforward with a large text field.
Elasticsearch stores analyzed tokens in the inverted index, but it does not provide a simple way to retrieve a full global list of unique terms from all documents.
If you need something like this, a few common approaches are:
Use term vectors to inspect the tokens generated for a field.
Use the _analyze API to see how the text is tokenized.
If aggregation of unique values is required, store the terms in a separate field (keyword or normalized) during ingestion and run a terms aggregation on that field.
In practice, if this is a recurring requirement, the best approach is usually to extract the relevant terms during ingestion and index them in a dedicated field so they can be aggregated efficiently.
1 Like
Adding to what Rafa already said - you can get the most popular words from a random sample of content using something like this aggregation: Background stats query (extracts popular words and counts from a random sample for use as background in significant text analysis) · GitHub
It’s too expensive to do this on too many documents and would cause memory issues so keep the numbers low.
1 Like
Thank you so much Rafa, Mark for taking time to share such a detailed response, will review the same. I’m afraid, that I have to go for some custom logic, since the requirements are getting complex, as I get into more details.