Total tokens count in all documents

tyka · February 25, 2016, 8:31pm

I am trying to get the total number of tokens in documents that match a query. I haven't defined any custom mapping and the field for which I want to get the token count is of type 'string'.

I tried the following query, but it gives a very large number in the order of 10^20, which is not the correct answer for my dataset.

curl -XPOST 'localhost:9200/nodename/comment/_search?pretty' -d '
{
"query": { "match_all":{} },"aggs" : {"tk_count":{ "sum" :  {"script" : "_index[\"body\"].sumttf()"}}},  "size": 0 }
}'

Any idea how to get the correct count of all tokens? ( I do not need counts for each term, but the total count).

Mark_Harwood · February 26, 2016, 3:55pm

Reading this [1] it looks like you are repeatedly (ie for every doc in the index) adding an index-level stat ( sumttf).
What I suspect you might want instead is to sum the number of tokens in each doc that matches the query which could be accessed via this script:

doc["body"].values.size()

[1] Text scoring in scripts | Elasticsearch Guide [2.2] | Elastic

tyka · February 26, 2016, 4:13pm

Thanks, that seems to be the case with sumttf().

I tried

and it seems to work. But for a large string field it throws CircuitBreaker exception due to large data size. Is there an efficient way to to this aggregation?

Thanks!

Mark_Harwood · February 26, 2016, 4:48pm

Good old circuit breaker. I meant to add a warning that this would be inefficient on large indices/fields.

The most efficient alternative is to have an indexed field that stores the number of tokens in a document which can then be accessed and summed by script-free aggregations.
This shifts the computation costs from query time to index time. With some special Analyzer configuration this could potentially be achieved using a custom TokenFilter that emits a single token that represents how many tokens (produced by earlier Tokenizers in the chain) were part of the document.
"Norms" [1] are a rougher Lucene measure of field length but I'm not sure these low-level per-doc values are accessible as part of the aggregation logic.

Cheers
Mark

[1] https://www.elastic.co/guide/en/elasticsearch/reference/2.2/norms.html

tyka · February 26, 2016, 5:21pm

Thanks Mark! Looks like I need to reindex my data so that I can do this queries efficiently.

Thanks,
Tyka

nik9000 · February 26, 2016, 6:06pm

tyka · February 26, 2016, 8:46pm

Thanks Nik! The "token_count" datatype will be useful for my purpose.

Topic		Replies	Views
Get the count of matched token in a document in elasticsearch regexp query Elasticsearch	1	606	July 5, 2017
How to get document length in terms of total terms/tokens - not in bytes? Elasticsearch	3	432	August 27, 2019
Query for counting docs in all indices Elasticsearch	9	3243	October 18, 2018
How to compute token counts efficiently? Elasticsearch	1	431	July 5, 2017
Total doc_count aggregations Elasticsearch	2	1358	February 7, 2022

Total tokens count in all documents

Related topics