Currently I am performing a terms aggregation like so:
"aggs": {
"wordcloud": {
"terms": {
"field": "transcription.raw",
"size": 40,
}
}
}
I get back results like look like this (for example):
"aggregations": {
"wordAppearences": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "banana",
"doc_count": 2
},
{
"key": "apples",
"doc_count": 1
}
]
}
}
This query is counting the number of documents in which each word appears but it does not count how many times that word occurs within that field. I want to get a count of how many times, across all documents, every word within the transcription.raw
field occurs, is this possible?
Thanks very much in advance
I want to get a count of how many times, across all documents, every word within the transcription.raw field occurs, is this possible?
That sounds very expensive to do on a big index. There's typically a very long tail of words that are only used once,
Let's back up a little first - what is the business requirement? I see you mention "word cloud" in your agg and the results of what you are asking for will, at great expense, tell you something I can probably already guess - the word the
is quite common. Typically generating word clouds is an exercise in picking out the more interesting words in a set so aggregations like significant_terms
are useful for this (where significant != popular)
The business requirement is indeed for a word cloud, although it doesn't have to be particularly performant and our data set is quite small.
I actually cut down my aggregation for brevity; we have a list of excluded words which includes the word "the".
The problem in my specific scenario is we don't actually know what word the transcription.raw
will contain, we just need a count of each individual word regardless of whether it's interesting or not so I don't think significant_terms
is applicable to us.
Have a look at Term Vectors. It may provide the term stats you want.
we just need a count of each individual word
For the purposes of a word cloud a straight-forward doc count should normally be enough without having to count the frequencies of use inside each doc. A terms
aggregation should give you that.
we have a list of excluded words which includes the word "the".
If it is free-text that tends to just be a sliding scale of English words though - "the, and, of, a, with, but, you, they, sometimes....". It's hard to draw a line on a popularity scale like this where the "interesting" words in your data suddenly appear.
The task of a Word cloud is often to give an overview of what makes some content different to everyday English e.g. Today vs all other days. If you are experimenting with smaller indices you could afford to index some background text (e.g. a selection of English wikipedia articles) to "diff" your content against - that should help tune in to the interesting words
As an example, by indexing Gutenburg books [1] we can diff the chapters of one book vs the backdrop of all other books and the significant terms for "War of the worlds" are "martians, horsell, deathray" etc. Without a backdrop to compare your content against there is no baseline for effectively determining what is interesting.
[1] https://www.gutenberg.org/
Is it possible to perform this on more than one document in one query?
A terms aggregation unfortunately only returns a count of the documents which contain a particular word.
So I I had two documents stored, each with a 'transcription.raw' field
1st document's 'transcription.raw' field contained "apples apples apples"
2nd document's 'transcription.raw' field contained "apples"
The term frequency for apples would be 2 as "apples" appears in each document.
What I would like to know in this instance is how many across all the documents the word "apples" appears (in this instance, it would be 4).
You can try the Multi termvectors API, which can be run against a whole index. However, I am not sure if the sum aggregation will work.
Understood, but for Word-cloud purposes on a collection the in-document repetition tends to be less useful and is balanced out across docs - the words that occur more frequently within docs correspondingly have a higher probability of appearing across docs.
For the record, I just re-ran the exercise of diffing one book against a backdrop of other books to get the interesting words. Here's the Graph plugin showing significant terms for 3 books compared to a backdrop of 12 books:
.. and here's the same analysis if you use the most popular terms instead of looking for significant ones:
Hopefully this helps illustrate the difference in quality