Get unique words (tokens ?) from all text fields from all documents in an Index

sivakumarp11 · March 12, 2026, 1:20pm

Is there a way to get list of ‘unique’ words Or terms used in a large text field in ALL the documents ? for ex. if the text field has content like ‘It is the most confidential because confidential or transcendental knowledge involves understanding the difference between the soul and the body’ then the output should be something like -

confidential, transcendental, knowledge, understanding, soul, body

Rafa_Silva · March 14, 2026, 12:54am

Yes, but it’s not straightforward with a large text field.
Elasticsearch stores analyzed tokens in the inverted index, but it does not provide a simple way to retrieve a full global list of unique terms from all documents.
If you need something like this, a few common approaches are:
Use term vectors to inspect the tokens generated for a field.
Use the _analyze API to see how the text is tokenized.
If aggregation of unique values is required, store the terms in a separate field (keyword or normalized) during ingestion and run a terms aggregation on that field.
In practice, if this is a recurring requirement, the best approach is usually to extract the relevant terms during ingestion and index them in a dedicated field so they can be aggregated efficiently.

Mark_Harwood1 · March 14, 2026, 8:16am

Adding to what Rafa already said - you can get the most popular words from a random sample of content using something like this aggregation: Background stats query (extracts popular words and counts from a random sample for use as background in significant text analysis) · GitHub

It’s too expensive to do this on too many documents and would cause memory issues so keep the numbers low.

sivakumarp11 · March 15, 2026, 5:26am

Thank you so much Rafa, Mark for taking time to share such a detailed response, will review the same. I’m afraid, that I have to go for some custom logic, since the requirements are getting complex, as I get into more details.

Topic		Replies	Views
Elasticsearch token_vector analysis over an entire field Elasticsearch	3	791	September 20, 2017
Word count/frequency per field Elasticsearch	2	3449	December 13, 2018
Find word count in text field Elasticsearch	0	373	November 1, 2021
Removing non-unique tokens during indexing Elasticsearch	3	413	August 24, 2013
Does anyone know of a way to get elasticsearch to return a word count? Elasticsearch	3	773	October 13, 2013

Get unique words (tokens ?) from all text fields from all documents in an Index

Related topics