I have been analysing Elasticsearch results with explain:true condition, I
am not able to understand what technique has been applied to calculate
idf. I went through the lucene scoring formula i.e.
idf(t) = 1+log(NumDocs/Doc frequency+1)
Does not matches my results.
Following is explanation for one of the results returned.
_explanation: {
value: 5.8878393
description: weight(city:chicago in 1) [PerFieldSimilarity], result
of:
details: [
{
value: 5.8878393
description: score(doc=1,freq=1.0 = termFreq=1.0 ), product
of:
details: [
{
value: 0.99999994
description: queryWeight, product of:
details: [
{
value: 5.88784
description: idf(docFreq=2, maxDocs=398)
}
{
value: 0.16984157
description: queryNorm
}
]
}
{
value: 5.88784
description: fieldWeight in 1, product of:
details: [
{
value: 1
description: tf(freq=1.0), with freq of:
details: [
{
value: 1
description: termFreq=1.0
}
]
}
{
value: 5.88784
description: idf(docFreq=2, maxDocs=398)
}
{
value: 1
description: fieldNorm(doc=1)
}
]
}
]
}
]
}
TF and IDF are calculated by shard, not per index, so the aggregated
explanation might not have the exact numbers. Try changing your search type
to a distributed one for more accurate results:
I have been analysing Elasticsearch results with explain:true condition, I
am not able to understand what technique has been applied to calculate
idf. I went through the lucene scoring formula i.e.
idf(t) = 1+log(NumDocs/Doc frequency+1)
Does not matches my results.
Following is explanation for one of the results returned.
_explanation: {
value: 5.8878393
description: weight(city:chicago in 1) [PerFieldSimilarity],
result of:
details: [
{
value: 5.8878393
description: score(doc=1,freq=1.0 = termFreq=1.0 ),
product of:
details: [
{
value: 0.99999994
description: queryWeight, product of:
details: [
{
value: 5.88784
description: idf(docFreq=2, maxDocs=398)
}
{
value: 0.16984157
description: queryNorm
}
]
}
{
value: 5.88784
description: fieldWeight in 1, product of:
details: [
{
value: 1
description: tf(freq=1.0), with freq of:
details: [
{
value: 1
description: termFreq=1.0
}
]
}
{
value: 5.88784
description: idf(docFreq=2, maxDocs=398)
}
{
value: 1
description: fieldNorm(doc=1)
}
]
}
]
}
]
}
Oops, obvious answer. I see questions about incorrect TFIDF scores and
my mind automatically goes to DFS scoring (which is actually about TF, not
IDF).
--
Ivan
On Tue, Feb 11, 2014 at 10:22 AM, Binh Ly binh@hibalo.com wrote:
Also be aware that the log should be a natural log, i.e. the base is e
instead of 10. So for example, pulling the first IDF from your results:
Thanks Brin, your answer solved my problem.
Thanks Ivan to you too, I am having 5 shards, idf is getting calculated on
the maxdocs present in that shard. Doesn't that leads to misleading idf?
On Tuesday, 11 February 2014 20:10:52 UTC+5:30, sunayana choudhary wrote:
Hi all,
I have been analysing Elasticsearch results with explain:true condition, I
am not able to understand what technique has been applied to calculate
idf. I went through the lucene scoring formula i.e.
idf(t) = 1+log(NumDocs/Doc frequency+1)
Does not matches my results.
Following is explanation for one of the results returned.
_explanation: {
value: 5.8878393
description: weight(city:chicago in 1) [PerFieldSimilarity],
result of:
details: [
{
value: 5.8878393
description: score(doc=1,freq=1.0 = termFreq=1.0 ),
product of:
details: [
{
value: 0.99999994
description: queryWeight, product of:
details: [
{
value: 5.88784
description: idf(docFreq=2, maxDocs=398)
}
{
value: 0.16984157
description: queryNorm
}
]
}
{
value: 5.88784
description: fieldWeight in 1, product of:
details: [
{
value: 1
description: tf(freq=1.0), with freq of:
details: [
{
value: 1
description: termFreq=1.0
}
]
}
{
value: 5.88784
description: idf(docFreq=2, maxDocs=398)
}
{
value: 1
description: fieldNorm(doc=1)
}
]
}
]
}
]
}
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.