What is the scope of TF & IDF calculation?

Youxu · May 18, 2016, 2:41am

I am not quire clear how ES calculate TF/IDF in some situations, like cross index/type search, search with filters etc.

Assume I have two indices, index1 and index2, each of which has two types, type1, and type2. All types of all indies have a filed: language which could be used as filter.

Cross index search
GET /index1,index2/type1,type2/_search

In this case, is IDF calculated based on all docs of all indices, (that is, same IDF used for index1 & index2), or calculated separately for index1 and index2 ( that is, different IDF for index1 and index2 )?
Search with filter
GET /index1/type1/_search
{
"filter": {
"term": {
"language": "english"
}
}
}

In this situation, is IDF calculated based on all docs in type1 of index1, or just based on docs whose language is "english"?

warkolm · May 18, 2016, 5:29am

The score is done per shard, then results are compared across all indices and reduced.
Filters do not score, they are a simple match or no-match.

See https://www.elastic.co/guide/en/elasticsearch/reference/2.3/query-filter-context.html for more

Youxu · May 18, 2016, 9:56am

thanks walkolm
for #1, I mean, the IDF is calculated before filtering or after filtering?

warkolm · May 18, 2016, 10:17am

You mean with a search in your second point, but over multiple indices?

Youxu · May 19, 2016, 4:32pm

Sorry for poor expression.
Let me explain my question with example

Assume I create an index: myindex with one type: mytype

And put 3 docs to /myindex/mytype

{
"title": "your search you data",
"language": "english"
}

{
"title": "hello there",
"language": "french"
}

{
"title": "I love elasticsearch",
"language": "english"
}

And search with following DSL (please ignore the incorrect synctax)
GET /myindex/mytype/_search
{
"query": {
"title": "hello"
}

"filter": {
"term": {
"language": "english"
}
}
}

With above filtered query, when ES calculate the IDF value, it counts the frequency of query term in all 3 documents or only in 2 documents whose language is "english"?

warkolm · May 23, 2016, 3:51am

It will filter only the "language": "english", then score any that pass that filter.

Ivan · May 23, 2016, 4:21pm

The IDF value is per shard, irregardless of the type. The "type" is an
Elasticsearch construct, and the Lucene shard knows nothing about them.

And if you noticed, I said it was per shard, not even per index. Since each
index has its own shards, the IDF values are never shared between indices.
But by default, they are not even shared between the same index. You need
to enable distributed queries for that to occur. Small performance hit, but
it is worth it in finely tuned search environments IMHO.

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-search-type.html

Cheers,

Ivan

Ivan · May 23, 2016, 4:24pm

And to answer your last question, the IDF of a term is pre-calculated and
not dependent on the documents returned. In other words, it is calculated
pre-filtering.

nik9000 · May 23, 2016, 5:42pm

It still includes deleted/old copies of updated documents as well.

I think that depends on uniformity of the index. This is the setting. In many cases you are better off just having a single shard for small index so they are automatically "uniform". If the index gets large enough to run multiple shards (a couple of GB) then it is worth playing with the search_type if score is important to you. The default is the default because for lots of people the indexes are pretty uniform and/or deviations in score aren't a huge problem. But the score can deviate quite a bit if you have terms that are genuinely rare.

Ivan · May 23, 2016, 6:56pm

Uniformity is indeed important. IDF values tend to normalize over large
data sets. The bigger the shard, the better. Rare terms, which is what IDF
was meant to improve, suffer on multi-sharded indices. I emphasize using
single shards for test shards when dealing with relevancy. Many issues with
test cases is simply because there is not enough data for relevant TF/IDF
values.

And I think the default is because no one uses Elasticsearch for search
anymore, so why go through the extra search tuning step.

Ivan

nik9000 · May 23, 2016, 7:40pm

I used it for search, but yeah, lots of use cases aren't search and they'd just be paying the extra query phase price for nothing.

Youxu · May 24, 2016, 2:01pm

Why do you say "no one uses elasticsearch for search anymore..."? Does this conclusion come from statistics of user scenario? And if it is true, does this mean Search functionality (including relevance tuning) will have low pri in ES's road map?

Topic		Replies	Views
IDF calculation based on Filter? Elasticsearch	2	448	June 12, 2018
How is the idf calculated for an alias that maps to multiple indexes? Elasticsearch	1	311	July 6, 2017
Compute TF/IDF across indexes Elasticsearch	5	2121	July 6, 2017
Computing idf in elasticsearch Elasticsearch	5	345	July 6, 2017
Question regarding TF/IDF implementation Elasticsearch	2	775	April 19, 2021

What is the scope of TF & IDF calculation?

Related topics