Output n-gram frequency distributions in Elasticsearch?

hacker_21 · October 27, 2018, 7:17pm

Say I have a field called "webpages" with an array of webpage plain-text content.

Through Elasticsearch, is there a way I can have it output the top 1 million 6 word long n-gram phrases based on term frequency? Maybe the top 3 million 2 word long n gram phrases?

By frequency, I'm referring to the number of times the n-gram appears in the "webpages" array across the whole index.

I'm thinking maybe ES already computes this information in advance for tf-idf computations, would be useful if I could output it and save it to a text file in a reasonable amount of time.

This would really save me time because all relevant data is already stored in one centralized place - Elasticsearch. Thanks in advance!

polyfractal · October 29, 2018, 11:33am

So this is actually fairly tricky. Elasticsearch has the information as you said, but it's not available for easy retrieval. Term and Doc frequencies are stored on a per-segment basis to facilitate distributed search. There's no global registry of all the term/doc frequencies... to get that information we'd have to compile all the dictionaries from all the segments across the cluster, dedupe, aggregate and order. Which is reasonably expensive

There's a TermVector API (https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-termvectors.html) which returns some of the information you're interested in, but the API interface is on a per-document basis. It's done this way because the information is expensive to retrieve so we only allow it being pulled out doc-by-doc.

I think the best option for you is to run a composite aggregation (https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-composite-aggregation.html) to collect all the ngrams, then order by count client-side. Composite agg is designed to be a memory friendly streaming aggregation, but lacks some niceties like sorting by count for that reason.

More traditional aggregations like the terms agg won't work because trying to sort a few million ngrams will quickly break your nodes with out-of-memory operations.

system · November 26, 2018, 11:33am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Terms / Documents Matrix Elasticsearch	4	498	July 6, 2017
Need a term frequency Report across the entire Index Elasticsearch	1	506	November 11, 2017
Count the occurrence of words in ElasticSearch Elasticsearch elastic-stack-monitoring , elastic-stack-alerting , docker	5	3316	January 11, 2022
Recreating Google's Ngram Viewer with elasticsearch Elasticsearch	1	528	July 6, 2017
Elasticsearch word frequency and relations Elasticsearch	2	1177	July 6, 2017

Output n-gram frequency distributions in Elasticsearch?

Related topics