I am not quire clear how ES calculate TF/IDF in some situations, like cross index/type search, search with filters etc.
Assume I have two indices, index1 and index2, each of which has two types, type1, and type2. All types of all indies have a filed: language which could be used as filter.
Cross index search
GET /index1,index2/type1,type2/_search
In this case, is IDF calculated based on all docs of all indices, (that is, same IDF used for index1 & index2), or calculated separately for index1 and index2 ( that is, different IDF for index1 and index2 )?
Search with filter
GET /index1/type1/_search
{
"filter": {
"term": {
"language": "english"
}
}
}
In this situation, is IDF calculated based on all docs in type1 of index1, or just based on docs whose language is "english"?
Sorry for poor expression.
Let me explain my question with example
Assume I create an index: myindex with one type: mytype
And put 3 docs to /myindex/mytype
{
"title": "your search you data",
"language": "english"
}
{
"title": "hello there",
"language": "french"
}
{
"title": "I love elasticsearch",
"language": "english"
}
And search with following DSL (please ignore the incorrect synctax)
GET /myindex/mytype/_search
{
"query": {
"title": "hello"
}
"filter": {
"term": {
"language": "english"
}
}
}
With above filtered query, when ES calculate the IDF value, it counts the frequency of query term in all 3 documents or only in 2 documents whose language is "english"?
The IDF value is per shard, irregardless of the type. The "type" is an
Elasticsearch construct, and the Lucene shard knows nothing about them.
And if you noticed, I said it was per shard, not even per index. Since each
index has its own shards, the IDF values are never shared between indices.
But by default, they are not even shared between the same index. You need
to enable distributed queries for that to occur. Small performance hit, but
it is worth it in finely tuned search environments IMHO.
And to answer your last question, the IDF of a term is pre-calculated and
not dependent on the documents returned. In other words, it is calculated
pre-filtering.
It still includes deleted/old copies of updated documents as well.
I think that depends on uniformity of the index. This is the setting. In many cases you are better off just having a single shard for small index so they are automatically "uniform". If the index gets large enough to run multiple shards (a couple of GB) then it is worth playing with the search_type if score is important to you. The default is the default because for lots of people the indexes are pretty uniform and/or deviations in score aren't a huge problem. But the score can deviate quite a bit if you have terms that are genuinely rare.
Uniformity is indeed important. IDF values tend to normalize over large
data sets. The bigger the shard, the better. Rare terms, which is what IDF
was meant to improve, suffer on multi-sharded indices. I emphasize using
single shards for test shards when dealing with relevancy. Many issues with
test cases is simply because there is not enough data for relevant TF/IDF
values.
And I think the default is because no one uses Elasticsearch for search
anymore, so why go through the extra search tuning step.
Why do you say "no one uses elasticsearch for search anymore..."? Does this conclusion come from statistics of user scenario? And if it is true, does this mean Search functionality (including relevance tuning) will have low pri in ES's road map?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.