How to calculate Inverse Document Frequency for particular term?


(Kartheek Gummaluri) #1

PUT /my_index/doc/1
{ "text" : "quick brown fox" }

GET /my_index/doc/_search?explain
{
"query": {
"term": {
"text": "fox"
}
}
}

weight(text:fox in 0) [PerFieldSimilarity]: 0.15342641
result of:
fieldWeight in 0 0.15342641
product of:
tf(freq=1.0), with freq of 1: 1.0
idf(docFreq=1, maxDocs=1): 0.30685282
fieldNorm(doc=0): 0.5

Can anyone can explain how these values are came by manual calcualtion.


(Christoph) #2

Hi,

tf(freq=1.0), with freq of 1:        1.0

This is the frequency of your search term in the matched doc.

idf(docFreq=1, maxDocs=1):           0.30685282

The ClassicSimilarity in Lucene calclulates this as: log(numDocs/(docFreq+1)) + 1, so if you fill in the values you get log(1/(1+1)) + 1

fieldNorm(doc=0):                    0.5 

Here it gets more complicated and hard to track by hand. Essentially the normalization factor for a document field should lower the score for documents with long fields. Lucene caclulates this already at index time, the formular is roughly 1.0 / Math.sqrt(numTerms) according to ClassicSimilarity#lengthNorm, so for three terms like in the example you would get ~0.577. This however is stored as a single byte and later converted back to float, so there are rounding issues as e.g. described here.

If you care to dive deeper into Scoring there's lots of general resources describing TF/IDF (with sometimes slightly different implementation details). The description of Lucenes TFIDF similarity looks complicated but worth taking a look at. For other than simple examples like the one you gave, calculating scores by hand is a complex task.


(system) #3