Different IDF for different documents

onami · June 28, 2018, 3:25am

Hello,
I'm trying to figure out why my documents are ranked the way they are, so here are two samples which are puzzling me.

The top result has docFreq = 45, docCount = 476 and weight(Synonym(descRu:лн descRu:ол descRu:олн) in 313).
The second one has docFreq = 64, docCount = 527 and weight(Synonym(descRu:лн descRu:ол descRu:олн) in 324).

It doesn't make sense for me because I expect the IDF to be the same since it's the same term.

Hope you could explain what's the deal.

polyfractal · June 29, 2018, 1:23pm

It's likely that they came from different shards. Term frequencies and IDFs are computed on a shard-local basis, which allows the search to happen in a coordination-free environment (the shards don't have to talk to each other and can execute in parallel).

Generally, this works fine because there is "enough data" to smooth out the discrepancies in TF/IDF, and scoring ends up being similar. But with few documents, or documents that aren't randomly distributed (e.g. using custom routing) you can run into more severe differences.

If you use DFS mode for search (https://www.elastic.co/guide/en/elasticsearch/reference/6.2/search-request-search-type.html#dfs-query-then-fetch), this executes a pre-search phase which collects TFs and IDFs from all the shards, compiles a "global" set of statistics and uses that on each shard for scoring.

Scoring will be more "accurate" at the cost of an extra round trip and more work. Generally it's not needed though.

system · July 27, 2018, 1:23pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Why does IDF differs on hits with same query? Elasticsearch	4	1373	July 5, 2017
Per Shard Statistics Elasticsearch	4	1150	July 6, 2017
Computing idf in elasticsearch Elasticsearch	5	345	July 6, 2017
Why is idf different for same term in same field in same shard? Elasticsearch	3	910	July 5, 2017
Question regarding TF/IDF implementation Elasticsearch	2	775	April 19, 2021

Different IDF for different documents

Related topics