Total tokens count in all documents


#1

I am trying to get the total number of tokens in documents that match a query. I haven't defined any custom mapping and the field for which I want to get the token count is of type 'string'.

I tried the following query, but it gives a very large number in the order of 10^20, which is not the correct answer for my dataset.

curl -XPOST 'localhost:9200/nodename/comment/_search?pretty' -d '
{
"query": { "match_all":{} },"aggs" : {"tk_count":{ "sum" :  {"script" : "_index[\"body\"].sumttf()"}}},  "size": 0 }
}'

Any idea how to get the correct count of all tokens? ( I do not need counts for each term, but the total count).


(Mark Harwood) #2

Reading this [1] it looks like you are repeatedly (ie for every doc in the index) adding an index-level stat ( sumttf).
What I suspect you might want instead is to sum the number of tokens in each doc that matches the query which could be accessed via this script:

doc["body"].values.size()

[1] https://www.elastic.co/guide/en/elasticsearch/reference/2.2/modules-advanced-scripting.html#_field_statistics_3


#3

Thanks, that seems to be the case with sumttf().

I tried

and it seems to work. But for a large string field it throws CircuitBreaker exception due to large data size. Is there an efficient way to to this aggregation?

Thanks!


(Mark Harwood) #4

Good old circuit breaker. I meant to add a warning that this would be inefficient on large indices/fields. :slightly_smiling:

The most efficient alternative is to have an indexed field that stores the number of tokens in a document which can then be accessed and summed by script-free aggregations.
This shifts the computation costs from query time to index time. With some special Analyzer configuration this could potentially be achieved using a custom TokenFilter that emits a single token that represents how many tokens (produced by earlier Tokenizers in the chain) were part of the document.
"Norms" [1] are a rougher Lucene measure of field length but I'm not sure these low-level per-doc values are accessible as part of the aggregation logic.

Cheers
Mark

[1] https://www.elastic.co/guide/en/elasticsearch/reference/2.2/norms.html


#5

Thanks Mark! Looks like I need to reindex my data so that I can do this queries efficiently.

Thanks,
Tyka


(Nik Everett) #6

https://www.elastic.co/guide/en/elasticsearch/reference/2.2/token-count.html


#7

Thanks Nik! The "token_count" datatype will be useful for my purpose.


(system) #8