What is the scope of TF & IDF calculation?

I am not quire clear how ES calculate TF/IDF in some situations, like cross index/type search, search with filters etc.

Assume I have two indices, index1 and index2, each of which has two types, type1, and type2. All types of all indies have a filed: language which could be used as filter.

  1. Cross index search
    GET /index1,index2/type1,type2/_search

    In this case, is IDF calculated based on all docs of all indices, (that is, same IDF used for index1 & index2), or calculated separately for index1 and index2 ( that is, different IDF for index1 and index2 )?

  2. Search with filter
    GET /index1/type1/_search
    {
    "filter": {
    "term": {
    "language": "english"
    }
    }
    }

    In this situation, is IDF calculated based on all docs in type1 of index1, or just based on docs whose language is "english"?

  1. The score is done per shard, then results are compared across all indices and reduced.
  2. Filters do not score, they are a simple match or no-match.

See https://www.elastic.co/guide/en/elasticsearch/reference/2.3/query-filter-context.html for more

thanks walkolm
for #1, I mean, the IDF is calculated before filtering or after filtering?

You mean with a search in your second point, but over multiple indices?

Sorry for poor expression.
Let me explain my question with example

Assume I create an index: myindex with one type: mytype

And put 3 docs to /myindex/mytype

{
"title": "your search you data",
"language": "english"
}

{
"title": "hello there",
"language": "french"
}

{
"title": "I love elasticsearch",
"language": "english"
}

And search with following DSL (please ignore the incorrect synctax)
GET /myindex/mytype/_search
{
"query": {
"title": "hello"
}

"filter": {
"term": {
"language": "english"
}
}
}

With above filtered query, when ES calculate the IDF value, it counts the frequency of query term in all 3 documents or only in 2 documents whose language is "english"?

It will filter only the "language": "english", then score any that pass that filter.

The IDF value is per shard, irregardless of the type. The "type" is an
Elasticsearch construct, and the Lucene shard knows nothing about them.

And if you noticed, I said it was per shard, not even per index. Since each
index has its own shards, the IDF values are never shared between indices.
But by default, they are not even shared between the same index. You need
to enable distributed queries for that to occur. Small performance hit, but
it is worth it in finely tuned search environments IMHO.

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-search-type.html

Cheers,

Ivan

And to answer your last question, the IDF of a term is pre-calculated and
not dependent on the documents returned. In other words, it is calculated
pre-filtering.

It still includes deleted/old copies of updated documents as well.

I think that depends on uniformity of the index. This is the setting. In many cases you are better off just having a single shard for small index so they are automatically "uniform". If the index gets large enough to run multiple shards (a couple of GB) then it is worth playing with the search_type if score is important to you. The default is the default because for lots of people the indexes are pretty uniform and/or deviations in score aren't a huge problem. But the score can deviate quite a bit if you have terms that are genuinely rare.

Uniformity is indeed important. IDF values tend to normalize over large
data sets. The bigger the shard, the better. Rare terms, which is what IDF
was meant to improve, suffer on multi-sharded indices. I emphasize using
single shards for test shards when dealing with relevancy. Many issues with
test cases is simply because there is not enough data for relevant TF/IDF
values.

And I think the default is because no one uses Elasticsearch for search
anymore, so why go through the extra search tuning step. :slight_smile:

Ivan

I used it for search, but yeah, lots of use cases aren't search and they'd just be paying the extra query phase price for nothing.

Why do you say "no one uses elasticsearch for search anymore..."? Does this conclusion come from statistics of user scenario? And if it is true, does this mean Search functionality (including relevance tuning) will have low pri in ES's road map?