Terms stats API

Shai_Erera · November 29, 2016, 3:24pm

Hi,

I've been reading and using the Field stats APi (https://www.elastic.co/guide/en/elasticsearch/reference/current/search-field-stats.html) to retrieve field-level statistics.

I couldn't find a similar term-level API, i.e. to expose the Lucene's level recorded stats (TermsEnum). Is there one that I'm missing? If not, is there a way to retrieve it?

I found this project on GitHub https://github.com/jprante/elasticsearch-index-termlist, however I have a current requirement to use ES without plug-ins, so was hoping I'm either missing the API, or there's a way to retrieve the stats via another API (was thinking maybe aggregations, but I'm not yet sure it'll work, am trying in parallel to posting here).

If there isn't such an API, is it possible to add it? From a complexity stand-point, it should be nearly identical to the field-stats API, since all the stats are already recorded in Lucene.

Thanks in advance,
Shai

Shai_Erera · November 29, 2016, 3:50pm

The statistic I'm interested in is TermsEnum.totalTermFreq(), which returns the total number of occurrences of a term across all documents.

I don't think aggregations can help me here, at least not the ones I tried. E.g. when I tried Terms aggregation, it returned the count of documents that are associated with a term. I tried as a query to restrict the matching docs to the term(s) I'm interested in, but this gives me a different statistic (TermsEnum.docFreq()). Anyway, I assume using aggregations for this task is heavy.

Is there another aggregation type that I can use to count the total term occurrences?

Shai_Erera · November 30, 2016, 8:17am

If I were to add this API to ES code, what's the best way to go about it? Do I need to fork the GitHub project and then submit a PR, or is it possible to push such changes into a branch in the ES repo?

Is it something that ES is interested in having at all?

Shai_Erera · November 30, 2016, 12:02pm

While I still think this is a useful API to have (and it also augments/completes the _field_stats API), I found a way to achieve that, using "artificial documents" (https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-termvectors.html#docs-termvectors-artificial-doc).

This is a really cool feature I must say :). Anyway, it allows me to pass the terms I'm interested in their statistics as the artificial document. I've tried it on a sample index I created and it worked.

Still, if there's an interest to have a terms statistics API in ES, I think I'll be happy to take on the challenge to add it.

Shai_Erera · November 30, 2016, 1:59pm

Wanted to give another update - I missed the part in the docs that say that for an artificial document a random shard is chosen to service the request. So looks like I need to query all of the index shards, then sum the 'ttf' across all terms.

system · December 28, 2016, 1:59pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Retrieving query stats Elasticsearch	4	367	July 6, 2017
How to retrieve field statistics now that _field_stats is deprecated Elasticsearch	3	883	November 23, 2017
Elasticsearch terms_stats doubt Elasticsearch	3	432	July 6, 2017
Count the occurrence of words in ElasticSearch Elasticsearch elastic-stack-monitoring , elastic-stack-alerting , docker	5	3315	January 11, 2022
Lucene vs Elastic Search Document Count difference and its impact on term aggregation buckets Elasticsearch	10	541	August 20, 2023

Terms stats API

Related topics