I am trying to get the total number of tokens in documents that match a query. I haven't defined any custom mapping and the field for which I want to get the token count is of type 'string'.
I tried the following query, but it gives a very large number in the order of 10^20, which is not the correct answer for my dataset.
Reading this [1] it looks like you are repeatedly (ie for every doc in the index) adding an index-level stat ( sumttf).
What I suspect you might want instead is to sum the number of tokens in each doc that matches the query which could be accessed via this script:
and it seems to work. But for a large string field it throws CircuitBreaker exception due to large data size. Is there an efficient way to to this aggregation?
Good old circuit breaker. I meant to add a warning that this would be inefficient on large indices/fields.
The most efficient alternative is to have an indexed field that stores the number of tokens in a document which can then be accessed and summed by script-free aggregations.
This shifts the computation costs from query time to index time. With some special Analyzer configuration this could potentially be achieved using a custom TokenFilter that emits a single token that represents how many tokens (produced by earlier Tokenizers in the chain) were part of the document.
"Norms" [1] are a rougher Lucene measure of field length but I'm not sure these low-level per-doc values are accessible as part of the aggregation logic.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.