Calculating with Document Frequency, not Inverse Document Frequency

Tomas_Petulik · June 19, 2013, 11:19am

Hello,

I have gist https://gist.github.com/anonymous/5813541 where I search for
'small'. But result where 'small' is just once is higher then the result
where 'small' is twice.

I want the results where the more frequent the 'small' is, the higher in
the results. So, I want DF, not IDF.

Is there any way to disable IDF or override idf() in Similarity class? Or
how can I solve the problem?

Thanks in advance.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ivan · June 20, 2013, 2:53pm

Your gist is not loading for my right now (github down?), so I will try to
answer without all the information.

In TF-IDF, the term frequency is used to weight which document is more
relevant, but the IDF is used to weigh which term in the query is more
relevant. For a single term query, IDF is irrelevant since it will be the
same for every document. You want to focus on the TF side of the algorithm.
First, you should have term frequencies enabled on the field. Second, the
field length is being normalized, so you might get some benefit from
omitting the norms.

Cheers,

Ivan

On Wed, Jun 19, 2013 at 4:19 AM, Tomas Petulik tomas.pet262@gmail.comwrote:

Hello,

I have gist elasticsearch example · GitHub where I search for
'small'. But result where 'small' is just once is higher then the result
where 'small' is twice.

I want the results where the more frequent the 'small' is, the higher in
the results. So, I want DF, not IDF.

Is there any way to disable IDF or override idf() in Similarity class? Or
how can I solve the problem?

Thanks in advance.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ivan · June 20, 2013, 4:37pm

Finally got your gist to load and noticed that you are using BM25. Never
used BM25 with elasticsearch/Lucene but my point about field length
normalization still remains. You can try changing the 'b' parameter. The
default value is 0.75. Posting your explain output will help.

I finally upgraded to 0.00, so I hope to play around with similarities soon.

Cheers,

Ivan

On Thu, Jun 20, 2013 at 7:53 AM, Ivan Brusic ivan@brusic.com wrote:

Your gist is not loading for my right now (github down?), so I will try to
answer without all the information.

In TF-IDF, the term frequency is used to weight which document is more
relevant, but the IDF is used to weigh which term in the query is more
relevant. For a single term query, IDF is irrelevant since it will be the
same for every document. You want to focus on the TF side of the algorithm.
First, you should have term frequencies enabled on the field. Second, the
field length is being normalized, so you might get some benefit from
omitting the norms.

Cheers,

Ivan

On Wed, Jun 19, 2013 at 4:19 AM, Tomas Petulik tomas.pet262@gmail.comwrote:

Hello,

I have gist elasticsearch example · GitHub where I search for
'small'. But result where 'small' is just once is higher then the result
where 'small' is twice.

I want the results where the more frequent the 'small' is, the higher in
the results. So, I want DF, not IDF.

Is there any way to disable IDF or override idf() in Similarity class? Or
how can I solve the problem?

Thanks in advance.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Tomas_Petulik · June 21, 2013, 7:57am

On Thursday, June 20, 2013 4:53:11 PM UTC+2, Ivan Brusic wrote:

In TF-IDF, the term frequency is used to weight which document is more
relevant, but the IDF is used to weigh which term in the query is more
relevant. For a single term query, IDF is irrelevant since it will be the
same for every document

Thank for the reply. To simplify, this is gist with default TF-IDF
https://gist.github.com/tomaspet262/5829590 with the results (explain=true)
at the bottom.

More specifically, look at the part of the explain with id=5 and id=7 (for
match in section). The excerpt is here
https://picasaweb.google.com/tomas.pet262/21June2013#5891840719317352962 .
The id=7 scores higher, because the IDF is 1 (because in id=7 small occurs
only once). The id=7 scores lower, because the IDF is 0.306.

This is the opposite of what I want. I want id=5 to scores higher (not
other way around). How can I achieve that?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Tomas_Petulik · June 21, 2013, 8:04am

Do not know why gist is not working for me either :'(. Mirror is at
http://pastebin.com/1HVAYSn0

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ivan · June 21, 2013, 5:50pm

Before trying to debug and fine-tune further, there are some other things
to look into.

First of all, TFIDF breaks down somewhat with small data sets. Scores begin
to stabilize as the index grows. The problem is exacerbate in elasticsearch
because TF-IDF is computed per shard and then the non-normalized results
are aggregated from each shard. Try one of two things:

Change your search type to dfs_query_then_fetch, which calculates TF-IDF
values across all shards. I found the performance penalty to be minimal.
Elasticsearch Platform — Find real-time answers at scale | Elastic
Change the number of shards to 1 for small indices

Perhaps you already have done one of the above, but just wanted to double
check before digging deeper.

--
Ivan

On Fri, Jun 21, 2013 at 1:04 AM, Tomas Petulik tomas.pet262@gmail.comwrote:

Do not know why gist is not working for me either :'(. Mirror is at
#! /bin/bash # DELETEcurl -XDELETE 'http://localhost:9200/test'echo - Pastebin.com

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Tomas_Petulik · June 25, 2013, 9:06am

On Friday, June 21, 2013 7:50:11 PM UTC+2, Ivan Brusic wrote:

Change your search type to dfs_query_then_fetch, which calculates TF-IDF
values across all shards. I found the performance penalty to be minimal.
Elasticsearch Platform — Find real-time answers at scale | Elastic

I change search type -> results are ordered exactly as I want. Problem
solved :-). Thank you so much!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
TF/IDF wihout TF Elasticsearch	4	894	July 6, 2017
How to completely disable Inverse document frequency? Elasticsearch	5	2020	September 19, 2018
Score based on Term Frequency alone Elasticsearch	2	3922	May 23, 2017
Different IDF for different documents Elasticsearch	2	449	July 27, 2018
Disabling Elasticsearch Inverse Document Frequency scoring on ES relevance score Elasticsearch	7	4342	March 16, 2017

Calculating with Document Frequency, not Inverse Document Frequency

Related topics