I'm trying to find a way to get the total length of a document in terms of tokens.
Right now I can achieve this using the termvectors API and do a sum along all terms' ttf, but I wonder if there's a more straightforward way?
Also I'd like to get the average length of document, also in number of tokens. I can do that by running through all documents, but I wonder if there's an easier way?
Some other pointers that may help you in your problem:
You can use explain=true parameter in your query to get more details how scores were calculated. As a part of these details, you can obtain an average length of document in this index field:
Another way is to have a special field type token_count that will calculate tokens count for every document for the specified field. To calculate the total value of tokens across all documents you can then create a sum aggregation on this token_count field, for the average value – avg aggregation.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.