How to get document length in terms of total terms/tokens - not in bytes?

I'm trying to find a way to get the total length of a document in terms of tokens.

Right now I can achieve this using the termvectors API and do a sum along all terms' ttf, but I wonder if there's a more straightforward way?

Also I'd like to get the average length of document, also in number of tokens. I can do that by running through all documents, but I wonder if there's an easier way?


Some other pointers that may help you in your problem:

  1. You can use explain=true parameter in your query to get more details how scores were calculated. As a part of these details, you can obtain an average length of document in this index field:

For example:

GET my_index/_search?explain=true
  "query" : {
  	"term": {"my_field": "fox"}

produces response:

description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
	"value": 6.0,
	"description": "avgFieldLength",
	"details": []
  1. Another way is to have a special field type token_count that will calculate tokens count for every document for the specified field. To calculate the total value of tokens across all documents you can then create a sum aggregation on this token_count field, for the average value – avg aggregation.

Thanks. I indeed used #2 to get the avg document length. Just seems a bit odd this is not provided as part of the return value of termvectors API.