I couldn't find a similar term-level API, i.e. to expose the Lucene's level recorded stats (TermsEnum). Is there one that I'm missing? If not, is there a way to retrieve it?
I found this project on GitHub https://github.com/jprante/elasticsearch-index-termlist, however I have a current requirement to use ES without plug-ins, so was hoping I'm either missing the API, or there's a way to retrieve the stats via another API (was thinking maybe aggregations, but I'm not yet sure it'll work, am trying in parallel to posting here).
If there isn't such an API, is it possible to add it? From a complexity stand-point, it should be nearly identical to the field-stats API, since all the stats are already recorded in Lucene.
The statistic I'm interested in is TermsEnum.totalTermFreq(), which returns the total number of occurrences of a term across all documents.
I don't think aggregations can help me here, at least not the ones I tried. E.g. when I tried Terms aggregation, it returned the count of documents that are associated with a term. I tried as a query to restrict the matching docs to the term(s) I'm interested in, but this gives me a different statistic (TermsEnum.docFreq()). Anyway, I assume using aggregations for this task is heavy.
Is there another aggregation type that I can use to count the total term occurrences?
If I were to add this API to ES code, what's the best way to go about it? Do I need to fork the GitHub project and then submit a PR, or is it possible to push such changes into a branch in the ES repo?
Is it something that ES is interested in having at all?
This is a really cool feature I must say :). Anyway, it allows me to pass the terms I'm interested in their statistics as the artificial document. I've tried it on a sample index I created and it worked.
Still, if there's an interest to have a terms statistics API in ES, I think I'll be happy to take on the challenge to add it.
Wanted to give another update - I missed the part in the docs that say that for an artificial document a random shard is chosen to service the request. So looks like I need to query all of the index shards, then sum the 'ttf' across all terms.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.