I am wondering if it is possible at all to get the top ten most frequent
words in an Elasticsearch field across an entire index or alias.
Here is what I'm trying to do:
I am indexing text documents extracted from various document types (Word,
Powerpoint, PDF, etc) these are analyzed and stored in a field called
doc_content. I would like to know if there is a way to find the most
frequent word(s) in a particular index that are stored in the doc_content
To make it clearer, lets assume I am indexing invoices from Amazon and eBay
for example. Now lets assume I have 100 invoices from amazon and 20
invoices from ebay. Lets also assume that the word "amazon" occurs twice in
each amazon invoice and the word "ebay" occurs 3 times in each ebay
Now, is there a way to get an aggregate of sort that tells me that the word
"amazon" appears in my index 200 times (100 invoices x 2
occurrences/invoice) and the word "ebay" occurs 60 times (20 invoices x 3
My other question is if the former is possible, then is there a way to
determine what is the most frequent word that comes after a certain word?
For example: lets assume I have 100 documents. 60 of these documents
contains the term "Old Cat" and 40 contains the term "Old Dog" and for the
sake of argument lets assume that these words only appear once in each
Now, if we can get the frequency of the word "old" which in our case should
be 100. Can we then determine a relation to the word that comes right after
it to have something like this:
__________ Cat (60) |
Old (100) |
|__________ Dog (40)
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to email@example.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b8056758-902f-4361-bb60-a8930aaa9725%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.