How to get document length in terms of total terms/tokens - not in bytes?

I'm trying to find a way to get the total length of a document in terms of tokens.

Right now I can achieve this using the termvectors API and do a sum along all terms' ttf, but I wonder if there's a more straightforward way?

Also I'd like to get the average length of document, also in number of tokens. I can do that by running through all documents, but I wonder if there's an easier way?

Thanks

Some other pointers that may help you in your problem:

  1. You can use explain=true parameter in your query to get more details how scores were calculated. As a part of these details, you can obtain an average length of document in this index field:

For example:

GET my_index/_search?explain=true
{
  "query" : {
  	"term": {"my_field": "fox"}
  }
}

produces response:

...
description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
...
{
	"value": 6.0,
	"description": "avgFieldLength",
	"details": []
},
  1. Another way is to have a special field type token_count that will calculate tokens count for every document for the specified field. To calculate the total value of tokens across all documents you can then create a sum aggregation on this token_count field, for the average value – avg aggregation.

Thanks. I indeed used #2 to get the avg document length. Just seems a bit odd this is not provided as part of the return value of termvectors API.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.