Terms aggregations - Getting a total of each word across all documents

markwalsh-liverpool · August 5, 2016, 12:16pm

Currently I am performing a terms aggregation like so:

"aggs": {
    "wordcloud": {
      "terms": {
        "field": "transcription.raw",
        "size": 40,
    	}
    }
}

I get back results like look like this (for example):

"aggregations": {
      "wordAppearences": {
         "doc_count_error_upper_bound": 0,
         "sum_other_doc_count": 0,
         "buckets": [
            {
               "key": "banana",
               "doc_count": 2
            },
            {
               "key": "apples",
               "doc_count": 1
            }
         ]
    }
}

This query is counting the number of documents in which each word appears but it does not count how many times that word occurs within that field. I want to get a count of how many times, across all documents, every word within the transcription.raw field occurs, is this possible?

Thanks very much in advance

Mark_Harwood · August 5, 2016, 2:35pm

I want to get a count of how many times, across all documents, every word within the transcription.raw field occurs, is this possible?

That sounds very expensive to do on a big index. There's typically a very long tail of words that are only used once,

Let's back up a little first - what is the business requirement? I see you mention "word cloud" in your agg and the results of what you are asking for will, at great expense, tell you something I can probably already guess - the word the is quite common. Typically generating word clouds is an exercise in picking out the more interesting words in a set so aggregations like significant_terms are useful for this (where significant != popular)

markwalsh-liverpool · August 5, 2016, 3:51pm

The business requirement is indeed for a word cloud, although it doesn't have to be particularly performant and our data set is quite small.

I actually cut down my aggregation for brevity; we have a list of excluded words which includes the word "the".

The problem in my specific scenario is we don't actually know what word the transcription.raw will contain, we just need a count of each individual word regardless of whether it's interesting or not so I don't think significant_terms is applicable to us.

Sherry_Ger · August 5, 2016, 4:04pm

Have a look at Term Vectors. It may provide the term stats you want.

Mark_Harwood · August 5, 2016, 4:08pm

we just need a count of each individual word

For the purposes of a word cloud a straight-forward doc count should normally be enough without having to count the frequencies of use inside each doc. A terms aggregation should give you that.

we have a list of excluded words which includes the word "the".

If it is free-text that tends to just be a sliding scale of English words though - "the, and, of, a, with, but, you, they, sometimes....". It's hard to draw a line on a popularity scale like this where the "interesting" words in your data suddenly appear.
The task of a Word cloud is often to give an overview of what makes some content different to everyday English e.g. Today vs all other days. If you are experimenting with smaller indices you could afford to index some background text (e.g. a selection of English wikipedia articles) to "diff" your content against - that should help tune in to the interesting words
As an example, by indexing Gutenburg books [1] we can diff the chapters of one book vs the backdrop of all other books and the significant terms for "War of the worlds" are "martians, horsell, deathray" etc. Without a backdrop to compare your content against there is no baseline for effectively determining what is interesting.

[1] https://www.gutenberg.org/

markwalsh-liverpool · August 5, 2016, 4:27pm

Is it possible to perform this on more than one document in one query?

markwalsh-liverpool · August 5, 2016, 4:31pm

A terms aggregation unfortunately only returns a count of the documents which contain a particular word.

So I I had two documents stored, each with a 'transcription.raw' field

1st document's 'transcription.raw' field contained "apples apples apples"
2nd document's 'transcription.raw' field contained "apples"

The term frequency for apples would be 2 as "apples" appears in each document.

What I would like to know in this instance is how many across all the documents the word "apples" appears (in this instance, it would be 4).

Sherry_Ger · August 5, 2016, 4:51pm

You can try the Multi termvectors API, which can be run against a whole index. However, I am not sure if the sum aggregation will work.

Mark_Harwood · August 5, 2016, 5:03pm

Understood, but for Word-cloud purposes on a collection the in-document repetition tends to be less useful and is balanced out across docs - the words that occur more frequently within docs correspondingly have a higher probability of appearing across docs.

For the record, I just re-ran the exercise of diffing one book against a backdrop of other books to get the interesting words. Here's the Graph plugin showing significant terms for 3 books compared to a backdrop of 12 books:

.. and here's the same analysis if you use the most popular terms instead of looking for significant ones:

Hopefully this helps illustrate the difference in quality

Topic		Replies	Views
Frequency of significant terms in documents matching a query Elasticsearch	1	337	July 6, 2017
WordCloud in Elasticsearch Elasticsearch	4	1752	July 6, 2017
Word count per document Elasticsearch	4	1810	July 6, 2017
How to get total term frequency through aggregation in elastic Elasticsearch	1	346	October 12, 2020
Total Count of a word from multiple documents (should not count only once but number of times it is present) Elasticsearch	1	433	February 27, 2018

Terms aggregations - Getting a total of each word across all documents

Related topics