Total tokens count in all documents


I am trying to get the total number of tokens in documents that match a query. I haven't defined any custom mapping and the field for which I want to get the token count is of type 'string'.

I tried the following query, but it gives a very large number in the order of 10^20, which is not the correct answer for my dataset.

curl -XPOST 'localhost:9200/nodename/comment/_search?pretty' -d '
"query": { "match_all":{} },"aggs" : {"tk_count":{ "sum" :  {"script" : "_index[\"body\"].sumttf()"}}},  "size": 0 }

Any idea how to get the correct count of all tokens? ( I do not need counts for each term, but the total count).

(Mark Harwood) #2

Reading this [1] it looks like you are repeatedly (ie for every doc in the index) adding an index-level stat ( sumttf).
What I suspect you might want instead is to sum the number of tokens in each doc that matches the query which could be accessed via this script:




Thanks, that seems to be the case with sumttf().

I tried

and it seems to work. But for a large string field it throws CircuitBreaker exception due to large data size. Is there an efficient way to to this aggregation?


(Mark Harwood) #4

Good old circuit breaker. I meant to add a warning that this would be inefficient on large indices/fields. :slightly_smiling:

The most efficient alternative is to have an indexed field that stores the number of tokens in a document which can then be accessed and summed by script-free aggregations.
This shifts the computation costs from query time to index time. With some special Analyzer configuration this could potentially be achieved using a custom TokenFilter that emits a single token that represents how many tokens (produced by earlier Tokenizers in the chain) were part of the document.
"Norms" [1] are a rougher Lucene measure of field length but I'm not sure these low-level per-doc values are accessible as part of the aggregation logic.




Thanks Mark! Looks like I need to reindex my data so that I can do this queries efficiently.


(Nik Everett) #6


Thanks Nik! The "token_count" datatype will be useful for my purpose.

(system) #8