Output n-gram frequency distributions in Elasticsearch?

(Hacker 21) #1

Say I have a field called "webpages" with an array of webpage plain-text content.

Through Elasticsearch, is there a way I can have it output the top 1 million 6 word long n-gram phrases based on term frequency? Maybe the top 3 million 2 word long n gram phrases?

By frequency, I'm referring to the number of times the n-gram appears in the "webpages" array across the whole index.

I'm thinking maybe ES already computes this information in advance for tf-idf computations, would be useful if I could output it and save it to a text file in a reasonable amount of time.

This would really save me time because all relevant data is already stored in one centralized place - Elasticsearch. Thanks in advance!

(Zachary Tong) #2

So this is actually fairly tricky. Elasticsearch has the information as you said, but it's not available for easy retrieval. Term and Doc frequencies are stored on a per-segment basis to facilitate distributed search. There's no global registry of all the term/doc frequencies... to get that information we'd have to compile all the dictionaries from all the segments across the cluster, dedupe, aggregate and order. Which is reasonably expensive :slight_smile:

There's a TermVector API (https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-termvectors.html) which returns some of the information you're interested in, but the API interface is on a per-document basis. It's done this way because the information is expensive to retrieve so we only allow it being pulled out doc-by-doc.

I think the best option for you is to run a composite aggregation (https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-composite-aggregation.html) to collect all the ngrams, then order by count client-side. Composite agg is designed to be a memory friendly streaming aggregation, but lacks some niceties like sorting by count for that reason.

More traditional aggregations like the terms agg won't work because trying to sort a few million ngrams will quickly break your nodes with out-of-memory operations.

(system) #3

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.