Computing idf in elasticsearch


(sunayana choudhary) #1

Hi all,

I have been analysing Elasticsearch results with explain:true condition, I
am not able to understand what technique has been applied to calculate
idf. I went through the lucene scoring formula i.e.

idf(t) = 1+log(NumDocs/Doc frequency+1)

Does not matches my results.
Following is explanation for one of the results returned.
_explanation: {
value: 5.8878393
description: weight(city:chicago in 1) [PerFieldSimilarity], result
of:
details: [
{
value: 5.8878393
description: score(doc=1,freq=1.0 = termFreq=1.0 ), product
of:
details: [
{
value: 0.99999994
description: queryWeight, product of:
details: [
{
value: 5.88784
description: idf(docFreq=2, maxDocs=398)
}
{
value: 0.16984157
description: queryNorm
}
]
}
{
value: 5.88784
description: fieldWeight in 1, product of:
details: [
{
value: 1
description: tf(freq=1.0), with freq of:
details: [
{
value: 1
description: termFreq=1.0
}
]
}
{
value: 5.88784
description: idf(docFreq=2, maxDocs=398)
}
{
value: 1
description: fieldNorm(doc=1)
}
]
}
]
}
]
}

}

Thanks in advance.. :slight_smile:

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c19fe1ea-aba7-4296-9536-cffc25a26836%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Ivan Brusic) #2

TF and IDF are calculated by shard, not per index, so the aggregated
explanation might not have the exact numbers. Try changing your search type
to a distributed one for more accurate results:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-search-type.html#dfs-query-then-fetch

How many shards does your index have?

--
Ivan

On Tue, Feb 11, 2014 at 6:40 AM, sunayana choudhary
sunayanacool@gmail.comwrote:

Hi all,

I have been analysing Elasticsearch results with explain:true condition, I
am not able to understand what technique has been applied to calculate
idf. I went through the lucene scoring formula i.e.

idf(t) = 1+log(NumDocs/Doc frequency+1)

Does not matches my results.
Following is explanation for one of the results returned.
_explanation: {
value: 5.8878393
description: weight(city:chicago in 1) [PerFieldSimilarity],
result of:
details: [
{
value: 5.8878393
description: score(doc=1,freq=1.0 = termFreq=1.0 ),
product of:
details: [
{
value: 0.99999994
description: queryWeight, product of:
details: [
{
value: 5.88784
description: idf(docFreq=2, maxDocs=398)
}
{
value: 0.16984157
description: queryNorm
}
]
}
{
value: 5.88784
description: fieldWeight in 1, product of:
details: [
{
value: 1
description: tf(freq=1.0), with freq of:
details: [
{
value: 1
description: termFreq=1.0
}
]
}
{
value: 5.88784
description: idf(docFreq=2, maxDocs=398)
}
{
value: 1
description: fieldNorm(doc=1)
}
]
}
]
}
]
}

}

Thanks in advance.. :slight_smile:

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/c19fe1ea-aba7-4296-9536-cffc25a26836%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQAcGFL900aM1w2rodE3OL-jFD3YEsQw%2B3nEb0ZvswE2YQ%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Binh Ly) #3

Also be aware that the log should be a natural log, i.e. the base is e
instead of 10. So for example, pulling the first IDF from your results:

value: 5.88784
description: idf(docFreq=2, maxDocs=398)
idf = 1 + ln(398 / (2 + 1)) = 5.8878397166163280134321081764042

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/127b4c5c-09d7-4536-a587-4132863db3aa%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Ivan Brusic) #4

Oops, obvious answer. :slight_smile: I see questions about incorrect TFIDF scores and
my mind automatically goes to DFS scoring (which is actually about TF, not
IDF).

--
Ivan

On Tue, Feb 11, 2014 at 10:22 AM, Binh Ly binh@hibalo.com wrote:

Also be aware that the log should be a natural log, i.e. the base is e
instead of 10. So for example, pulling the first IDF from your results:

value: 5.88784
description: idf(docFreq=2, maxDocs=398)
idf = 1 + ln(398 / (2 + 1)) = 5.8878397166163280134321081764042

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/127b4c5c-09d7-4536-a587-4132863db3aa%40googlegroups.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQBbuxyKW4e9oq_PmD2y_nZ%2BNom%2BHkjsFBjhSuyUUcCSFQ%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(sunayana choudhary) #5

Thanks Brin, your answer solved my problem.
Thanks Ivan to you too, I am having 5 shards, idf is getting calculated on
the maxdocs present in that shard. Doesn't that leads to misleading idf?

On Tuesday, 11 February 2014 20:10:52 UTC+5:30, sunayana choudhary wrote:

Hi all,

I have been analysing Elasticsearch results with explain:true condition, I
am not able to understand what technique has been applied to calculate
idf. I went through the lucene scoring formula i.e.

idf(t) = 1+log(NumDocs/Doc frequency+1)

Does not matches my results.
Following is explanation for one of the results returned.
_explanation: {
value: 5.8878393
description: weight(city:chicago in 1) [PerFieldSimilarity],
result of:
details: [
{
value: 5.8878393
description: score(doc=1,freq=1.0 = termFreq=1.0 ),
product of:
details: [
{
value: 0.99999994
description: queryWeight, product of:
details: [
{
value: 5.88784
description: idf(docFreq=2, maxDocs=398)
}
{
value: 0.16984157
description: queryNorm
}
]
}
{
value: 5.88784
description: fieldWeight in 1, product of:
details: [
{
value: 1
description: tf(freq=1.0), with freq of:
details: [
{
value: 1
description: termFreq=1.0
}
]
}
{
value: 5.88784
description: idf(docFreq=2, maxDocs=398)
}
{
value: 1
description: fieldNorm(doc=1)
}
]
}
]
}
]
}

}

Thanks in advance.. :slight_smile:

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/dc8907ed-59c5-4432-9472-b596e9e3a4ea%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #6