Compute TF/IDF across indexes

luizgpsantos · February 25, 2014, 8:00pm

Hi,

I'm trying to search across multiple indexes and I couldn't understand the
result of the TF/TDF function. I didn't expect for the indexes where the
term is more frequent to get penalized.

Here follows an example:

gist.github.com

https://gist.github.com/luizgpsantos/9216108

gistfile1.sh

curl  -XPUT localhost:9200/index1/type/1 -d '{
  "title": "alice dance"
}'

curl  -XPUT localhost:9200/index1/type/2 -d '{
  "title": "alice jump"
}'

curl  -XPUT localhost:9200/index1/type/3 -d '{
  "title": "alice run"

This file has been truncated. show original

When searching for the term "alice" the document {"_index": "index2",
"_type": "type", "_id": "1"} got a score 0.8784157 while {"_index": "index1",
"_type": "type", "_id": "1"} got a score 0.4451987.

In my use case I got one index about sports and another about celebrities
and when I search for a celebrity documents across sports and celebrities
indexes, results from sports index tend to appear in first place due to the
explanation above (we have few celebrities documents in sports index). But
the point is that when searching for a celebrity I would expect results
from the celebrity index.

Is there any way to calculate the score not penalizing indexes where the
frequency of a term is higher?

Cheers,

--
Luiz Guilherme P. Santos

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAMdL%3DZGe4ywgNX0JaBjQQ0HAc9_CQ-iz0trZ7vbqT4CVvizmpQ%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ivan · February 25, 2014, 8:15pm

I have never tried or looked at the code, but off the top of my head
perhaps the DFS query type would work:

Since the DFS query type calculates the TF/IDF values based on the values
in each individual shard, perhaps it ignores which index the shard belongs
to. Easy to test.

If not, the solution might be tricky. You can eliminate term length
normalization, but your issue is with the IDF. You can create your own
Similarity, but the best you can do is ignore the IDF, which probably would
not be ideal.

Ultimately, you can try script based scoring. The TF/IDF values are exposed
to the scripts, so you can try to apply some type of normalization
yourself. Kludgy and it would impact performance.

Hopefully DFS queries would work or someone else has a better idea!

Cheers,

Ivan

On Tue, Feb 25, 2014 at 12:00 PM, Luiz Guilherme Pais dos Santos <
luizgpsantos@gmail.com> wrote:

Hi,

I'm trying to search across multiple indexes and I couldn't understand the
result of the TF/TDF function. I didn't expect for the indexes where the
term is more frequent to get penalized.

Here follows an example:
Compute TF/IDF across indexes · GitHub

When searching for the term "alice" the document {"_index": "index2",
"_type": "type", "_id": "1"} got a score 0.8784157 while {"_index":
"index1", "_type": "type", "_id": "1"} got a score 0.4451987.

In my use case I got one index about sports and another about celebrities
and when I search for a celebrity documents across sports and celebrities
indexes, results from sports index tend to appear in first place due to the
explanation above (we have few celebrities documents in sports index). But
the point is that when searching for a celebrity I would expect results
from the celebrity index.

Is there any way to calculate the score not penalizing indexes where the
frequency of a term is higher?

Cheers,

--
Luiz Guilherme P. Santos

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAMdL%3DZGe4ywgNX0JaBjQQ0HAc9_CQ-iz0trZ7vbqT4CVvizmpQ%40mail.gmail.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQDgREX6svvcso%2Bf6VqW2Y6-DvBnWUtO5tVod8GAX2b0Bw%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

luizgpsantos · February 26, 2014, 2:04am

Hi Ivan,

The DFS query then fetch worked very well!

Thank you!

Cheers,
Luiz Guilherme

On Tue, Feb 25, 2014 at 5:15 PM, Ivan Brusic ivan@brusic.com wrote:

I have never tried or looked at the code, but off the top of my head
perhaps the DFS query type would work:
Elasticsearch Platform — Find real-time answers at scale | Elastic

Since the DFS query type calculates the TF/IDF values based on the values
in each individual shard, perhaps it ignores which index the shard belongs
to. Easy to test.

If not, the solution might be tricky. You can eliminate term length
normalization, but your issue is with the IDF. You can create your own
Similarity, but the best you can do is ignore the IDF, which probably would
not be ideal.

Ultimately, you can try script based scoring. The TF/IDF values are
exposed to the scripts, so you can try to apply some type of normalization
yourself. Kludgy and it would impact performance.

Elasticsearch Platform — Find real-time answers at scale | Elastic

Hopefully DFS queries would work or someone else has a better idea!

Cheers,

Ivan

On Tue, Feb 25, 2014 at 12:00 PM, Luiz Guilherme Pais dos Santos <
luizgpsantos@gmail.com> wrote:

Hi,

I'm trying to search across multiple indexes and I couldn't understand
the result of the TF/TDF function. I didn't expect for the indexes where
the term is more frequent to get penalized.

Here follows an example:
Compute TF/IDF across indexes · GitHub

When searching for the term "alice" the document {"_index": "index2",
"_type": "type", "_id": "1"} got a score 0.8784157 while {"_index":
"index1", "_type": "type", "_id": "1"} got a score 0.4451987.

In my use case I got one index about sports and another about celebrities
and when I search for a celebrity documents across sports and celebrities
indexes, results from sports index tend to appear in first place due to the
explanation above (we have few celebrities documents in sports index). But
the point is that when searching for a celebrity I would expect results
from the celebrity index.

Is there any way to calculate the score not penalizing indexes where the
frequency of a term is higher?

Cheers,

--
Luiz Guilherme P. Santos

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAMdL%3DZGe4ywgNX0JaBjQQ0HAc9_CQ-iz0trZ7vbqT4CVvizmpQ%40mail.gmail.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQDgREX6svvcso%2Bf6VqW2Y6-DvBnWUtO5tVod8GAX2b0Bw%40mail.gmail.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
Luiz Guilherme P. Santos

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAMdL%3DZGLPTbZgwyoBARjwcg9v0sUsjuxw4m_6X1iFQqO6zTHaQ%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ivan · February 26, 2014, 6:46am

Great, I am glad that it worked. I do not use multi-index searches, so I
was not sure if it would. Good to know that shards from different indices
can be aggregated with DFS queries.

--
Ivan

On Tue, Feb 25, 2014 at 6:04 PM, Luiz Guilherme Pais dos Santos <
luizgpsantos@gmail.com> wrote:

Hi Ivan,

The DFS query then fetch worked very well!

Thank you!

Cheers,
Luiz Guilherme

On Tue, Feb 25, 2014 at 5:15 PM, Ivan Brusic ivan@brusic.com wrote:

I have never tried or looked at the code, but off the top of my head
perhaps the DFS query type would work:
Elasticsearch Platform — Find real-time answers at scale | Elastic

Since the DFS query type calculates the TF/IDF values based on the values
in each individual shard, perhaps it ignores which index the shard belongs
to. Easy to test.

If not, the solution might be tricky. You can eliminate term length
normalization, but your issue is with the IDF. You can create your own
Similarity, but the best you can do is ignore the IDF, which probably would
not be ideal.

Ultimately, you can try script based scoring. The TF/IDF values are
exposed to the scripts, so you can try to apply some type of normalization
yourself. Kludgy and it would impact performance.

Elasticsearch Platform — Find real-time answers at scale | Elastic

Hopefully DFS queries would work or someone else has a better idea!

Cheers,

Ivan

On Tue, Feb 25, 2014 at 12:00 PM, Luiz Guilherme Pais dos Santos <
luizgpsantos@gmail.com> wrote:

Hi,

I'm trying to search across multiple indexes and I couldn't understand
the result of the TF/TDF function. I didn't expect for the indexes where
the term is more frequent to get penalized.

Here follows an example:
Compute TF/IDF across indexes · GitHub

When searching for the term "alice" the document {"_index": "index2",
"_type": "type", "_id": "1"} got a score 0.8784157 while {"_index":
"index1", "_type": "type", "_id": "1"} got a score 0.4451987.

In my use case I got one index about sports and another about
celebrities and when I search for a celebrity documents across sports and
celebrities indexes, results from sports index tend to appear in first
place due to the explanation above (we have few celebrities documents in
sports index). But the point is that when searching for a celebrity I would
expect results from the celebrity index.

Is there any way to calculate the score not penalizing indexes where the
frequency of a term is higher?

Cheers,

--
Luiz Guilherme P. Santos

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAMdL%3DZGe4ywgNX0JaBjQQ0HAc9_CQ-iz0trZ7vbqT4CVvizmpQ%40mail.gmail.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQDgREX6svvcso%2Bf6VqW2Y6-DvBnWUtO5tVod8GAX2b0Bw%40mail.gmail.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
Luiz Guilherme P. Santos

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAMdL%3DZGLPTbZgwyoBARjwcg9v0sUsjuxw4m_6X1iFQqO6zTHaQ%40mail.gmail.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQDygVTcJwb9BcsC5_7zx5KC2q3FVWj%3DinEt1MjS%2Bp1ZZg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Binh_Ly_2 · February 26, 2014, 1:53pm

I tried this and indeed it works, so thanks Ivan for the tip!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/57018734-6da3-4991-9d90-4422f92c2aa6%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Text match scoring on multiple indexes Elasticsearch	1	372	March 6, 2020
Global Scores in ElasticSearch Elasticsearch	3	513	July 6, 2017
Ranking across indices using dfs_query_then_fetch Elasticsearch	1	482	August 7, 2019
Custom relevance scoring by term frequency averages Elasticsearch	2	1217	July 6, 2017
What is the scope of TF & IDF calculation? Elasticsearch	12	3805	July 5, 2017

Compute TF/IDF across indexes

Related topics