How to get document length in terms of total terms/tokens - not in bytes?

peidaqi · July 18, 2019, 1:14am

I'm trying to find a way to get the total length of a document in terms of tokens.

Right now I can achieve this using the termvectors API and do a sum along all terms' ttf, but I wonder if there's a more straightforward way?

Also I'd like to get the average length of document, also in number of tokens. I can do that by running through all documents, but I wonder if there's an easier way?

Thanks

mayya · July 25, 2019, 8:30pm

Some other pointers that may help you in your problem:

You can use explain=true parameter in your query to get more details how scores were calculated. As a part of these details, you can obtain an average length of document in this index field:

For example:

GET my_index/_search?explain=true
{
  "query" : {
  	"term": {"my_field": "fox"}
  }
}

produces response:

...
description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
...
{
	"value": 6.0,
	"description": "avgFieldLength",
	"details": []
},

Another way is to have a special field type token_count that will calculate tokens count for every document for the specified field. To calculate the total value of tokens across all documents you can then create a sum aggregation on this token_count field, for the average value – avg aggregation.

peidaqi · July 30, 2019, 12:43am

Thanks. I indeed used #2 to get the avg document length. Just seems a bit odd this is not provided as part of the return value of termvectors API.

system · August 27, 2019, 12:43am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Total tokens count in all documents Elasticsearch	7	2241	July 5, 2017
Word count/frequency per field Elasticsearch	3	3335	January 10, 2019
Get terms aggregation bucket length Elasticsearch	1	561	March 11, 2020
How to calculate the total document length Elasticsearch	7	2153	July 23, 2018
Count words/tokens in a field in a document Elasticsearch	3	4013	January 11, 2017

How to get document length in terms of total terms/tokens - not in bytes?

Related topics