How to calculate Inverse Document Frequency for particular term?

kartheek91 · December 10, 2015, 5:38am

PUT /my_index/doc/1
{ "text" : "quick brown fox" }

GET /my_index/doc/_search?explain
{
"query": {
"term": {
"text": "fox"
}
}
}

weight(text:fox in 0) [PerFieldSimilarity]: 0.15342641
result of:
fieldWeight in 0 0.15342641
product of:
tf(freq=1.0), with freq of 1: 1.0
idf(docFreq=1, maxDocs=1): 0.30685282
fieldNorm(doc=0): 0.5

Can anyone can explain how these values are came by manual calcualtion.

cbuescher · December 10, 2015, 1:27pm

Hi,

tf(freq=1.0), with freq of 1:        1.0

This is the frequency of your search term in the matched doc.

idf(docFreq=1, maxDocs=1):           0.30685282

The ClassicSimilarity in Lucene calclulates this as: log(numDocs/(docFreq+1)) + 1, so if you fill in the values you get log(1/(1+1)) + 1

fieldNorm(doc=0):                    0.5

Here it gets more complicated and hard to track by hand. Essentially the normalization factor for a document field should lower the score for documents with long fields. Lucene caclulates this already at index time, the formular is roughly 1.0 / Math.sqrt(numTerms) according to ClassicSimilarity#lengthNorm, so for three terms like in the example you would get ~0.577. This however is stored as a single byte and later converted back to float, so there are rounding issues as e.g. described here.

If you care to dive deeper into Scoring there's lots of general resources describing TF/IDF (with sometimes slightly different implementation details). The description of Lucenes TFIDF similarity looks complicated but worth taking a look at. For other than simple examples like the one you gave, calculating scores by hand is a complex task.

Topic		Replies	Views
Computing idf in elasticsearch Elasticsearch	5	343	July 6, 2017
How does fieldNorm calculated in the example Elasticsearch	4	1874	July 5, 2017
Custom relevance scoring by term frequency averages Elasticsearch	2	1213	July 6, 2017
Reverse idf so more common terms score higher than rarer terms Elasticsearch	2	644	July 6, 2017
Different IDF for different documents Elasticsearch	2	449	July 27, 2018

How to calculate Inverse Document Frequency for particular term?

Related topics