Elasticsearch word frequency and relations


(Zaid Amir) #1

Hi,

I am wondering if it is possible at all to get the top ten most frequent
words in an Elasticsearch field across an entire index or alias.

Here is what I'm trying to do:

I am indexing text documents extracted from various document types (Word,
Powerpoint, PDF, etc) these are analyzed and stored in a field called
doc_content. I would like to know if there is a way to find the most
frequent word(s) in a particular index that are stored in the doc_content
field.

To make it clearer, lets assume I am indexing invoices from Amazon and eBay
for example. Now lets assume I have 100 invoices from amazon and 20
invoices from ebay. Lets also assume that the word "amazon" occurs twice in
each amazon invoice and the word "ebay" occurs 3 times in each ebay
invoice.

Now, is there a way to get an aggregate of sort that tells me that the word
"amazon" appears in my index 200 times (100 invoices x 2
occurrences/invoice) and the word "ebay" occurs 60 times (20 invoices x 3
occurrences/invoice).

My other question is if the former is possible, then is there a way to
determine what is the most frequent word that comes after a certain word?

For example: lets assume I have 100 documents. 60 of these documents
contains the term "Old Cat" and 40 contains the term "Old Dog" and for the
sake of argument lets assume that these words only appear once in each
document.

Now, if we can get the frequency of the word "old" which in our case should
be 100. Can we then determine a relation to the word that comes right after
it to have something like this:

          __________ Cat (60)
          |

Old (100) |
|__________ Dog (40)

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b8056758-902f-4361-bb60-a8930aaa9725%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(e.moeller) #2

Similar question as Zaid's first question - keyword extraction along TF-IDF
logic. Specifically, I have a corpus of ~10K articles and am looking to
get a ranking of all the tokenized terms in each article based on their
frequency in the article and the terms relative frequency across the
corpus. Thanks!

On Sunday, May 3, 2015 at 3:49:07 AM UTC-4, Zaid Amir wrote:

Hi,

I am wondering if it is possible at all to get the top ten most frequent
words in an Elasticsearch field across an entire index or alias.

Here is what I'm trying to do:

I am indexing text documents extracted from various document types (Word,
Powerpoint, PDF, etc) these are analyzed and stored in a field called
doc_content. I would like to know if there is a way to find the most
frequent word(s) in a particular index that are stored in the doc_content
field.

To make it clearer, lets assume I am indexing invoices from Amazon and
eBay for example. Now lets assume I have 100 invoices from amazon and 20
invoices from ebay. Lets also assume that the word "amazon" occurs twice in
each amazon invoice and the word "ebay" occurs 3 times in each ebay
invoice.

Now, is there a way to get an aggregate of sort that tells me that the
word "amazon" appears in my index 200 times (100 invoices x 2
occurrences/invoice) and the word "ebay" occurs 60 times (20 invoices x 3
occurrences/invoice).

My other question is if the former is possible, then is there a way to
determine what is the most frequent word that comes after a certain word?

For example: lets assume I have 100 documents. 60 of these documents
contains the term "Old Cat" and 40 contains the term "Old Dog" and for the
sake of argument lets assume that these words only appear once in each
document.

Now, if we can get the frequency of the word "old" which in our case
should be 100. Can we then determine a relation to the word that comes
right after it to have something like this:

          __________ Cat (60)
          |

Old (100) |
|__________ Dog (40)

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/aebf97f7-e20f-4d8e-a513-f79df4256b71%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(system) #3