Say I have a field called "webpages" with an array of webpage plain-text content.
Through Elasticsearch, is there a way I can have it output the top 1 million 6 word long n-gram phrases based on term frequency? Maybe the top 3 million 2 word long n gram phrases?
By frequency, I'm referring to the number of times the n-gram appears in the "webpages" array across the whole index.
I'm thinking maybe ES already computes this information in advance for tf-idf computations, would be useful if I could output it and save it to a text file in a reasonable amount of time.
This would really save me time because all relevant data is already stored in one centralized place - Elasticsearch. Thanks in advance!
So this is actually fairly tricky. Elasticsearch has the information as you said, but it's not available for easy retrieval. Term and Doc frequencies are stored on a per-segment basis to facilitate distributed search. There's no global registry of all the term/doc frequencies... to get that information we'd have to compile all the dictionaries from all the segments across the cluster, dedupe, aggregate and order. Which is reasonably expensive
More traditional aggregations like the terms agg won't work because trying to sort a few million ngrams will quickly break your nodes with out-of-memory operations.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.