Terms stats API


(Shai Erera) #1

Hi,

I've been reading and using the Field stats APi (https://www.elastic.co/guide/en/elasticsearch/reference/current/search-field-stats.html) to retrieve field-level statistics.

I couldn't find a similar term-level API, i.e. to expose the Lucene's level recorded stats (TermsEnum). Is there one that I'm missing? If not, is there a way to retrieve it?

I found this project on GitHub https://github.com/jprante/elasticsearch-index-termlist, however I have a current requirement to use ES without plug-ins, so was hoping I'm either missing the API, or there's a way to retrieve the stats via another API (was thinking maybe aggregations, but I'm not yet sure it'll work, am trying in parallel to posting here).

If there isn't such an API, is it possible to add it? From a complexity stand-point, it should be nearly identical to the field-stats API, since all the stats are already recorded in Lucene.

Thanks in advance,
Shai


(Shai Erera) #2

The statistic I'm interested in is TermsEnum.totalTermFreq(), which returns the total number of occurrences of a term across all documents.

I don't think aggregations can help me here, at least not the ones I tried. E.g. when I tried Terms aggregation, it returned the count of documents that are associated with a term. I tried as a query to restrict the matching docs to the term(s) I'm interested in, but this gives me a different statistic (TermsEnum.docFreq()). Anyway, I assume using aggregations for this task is heavy.

Is there another aggregation type that I can use to count the total term occurrences?


(Shai Erera) #3

If I were to add this API to ES code, what's the best way to go about it? Do I need to fork the GitHub project and then submit a PR, or is it possible to push such changes into a branch in the ES repo?

Is it something that ES is interested in having at all?


(Shai Erera) #4

While I still think this is a useful API to have (and it also augments/completes the _field_stats API), I found a way to achieve that, using "artificial documents" (https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-termvectors.html#docs-termvectors-artificial-doc).

This is a really cool feature I must say :). Anyway, it allows me to pass the terms I'm interested in their statistics as the artificial document. I've tried it on a sample index I created and it worked.

Still, if there's an interest to have a terms statistics API in ES, I think I'll be happy to take on the challenge to add it.


(Shai Erera) #5

Wanted to give another update - I missed the part in the docs that say that for an artificial document a random shard is chosen to service the request. So looks like I need to query all of the index shards, then sum the 'ttf' across all terms.


(system) #6

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.