Calculating with Document Frequency, not Inverse Document Frequency

Hello,

I have gist https://gist.github.com/anonymous/5813541 where I search for
'small'. But result where 'small' is just once is higher then the result
where 'small' is twice.

I want the results where the more frequent the 'small' is, the higher in
the results. So, I want DF, not IDF.

Is there any way to disable IDF or override idf() in Similarity class? Or
how can I solve the problem?

Thanks in advance.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Your gist is not loading for my right now (github down?), so I will try to
answer without all the information.

In TF-IDF, the term frequency is used to weight which document is more
relevant, but the IDF is used to weigh which term in the query is more
relevant. For a single term query, IDF is irrelevant since it will be the
same for every document. You want to focus on the TF side of the algorithm.
First, you should have term frequencies enabled on the field. Second, the
field length is being normalized, so you might get some benefit from
omitting the norms.

Cheers,

Ivan

On Wed, Jun 19, 2013 at 4:19 AM, Tomas Petulik tomas.pet262@gmail.comwrote:

Hello,

I have gist https://gist.github.com/anonymous/5813541 where I search for
'small'. But result where 'small' is just once is higher then the result
where 'small' is twice.

I want the results where the more frequent the 'small' is, the higher in
the results. So, I want DF, not IDF.

Is there any way to disable IDF or override idf() in Similarity class? Or
how can I solve the problem?

Thanks in advance.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Finally got your gist to load and noticed that you are using BM25. Never
used BM25 with elasticsearch/Lucene but my point about field length
normalization still remains. You can try changing the 'b' parameter. The
default value is 0.75. Posting your explain output will help.

I finally upgraded to 0.00, so I hope to play around with similarities soon.

Cheers,

Ivan

On Thu, Jun 20, 2013 at 7:53 AM, Ivan Brusic ivan@brusic.com wrote:

Your gist is not loading for my right now (github down?), so I will try to
answer without all the information.

In TF-IDF, the term frequency is used to weight which document is more
relevant, but the IDF is used to weigh which term in the query is more
relevant. For a single term query, IDF is irrelevant since it will be the
same for every document. You want to focus on the TF side of the algorithm.
First, you should have term frequencies enabled on the field. Second, the
field length is being normalized, so you might get some benefit from
omitting the norms.

Cheers,

Ivan

On Wed, Jun 19, 2013 at 4:19 AM, Tomas Petulik tomas.pet262@gmail.comwrote:

Hello,

I have gist https://gist.github.com/anonymous/5813541 where I search for
'small'. But result where 'small' is just once is higher then the result
where 'small' is twice.

I want the results where the more frequent the 'small' is, the higher in
the results. So, I want DF, not IDF.

Is there any way to disable IDF or override idf() in Similarity class? Or
how can I solve the problem?

Thanks in advance.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

On Thursday, June 20, 2013 4:53:11 PM UTC+2, Ivan Brusic wrote:

In TF-IDF, the term frequency is used to weight which document is more
relevant, but the IDF is used to weigh which term in the query is more
relevant. For a single term query, IDF is irrelevant since it will be the
same for every document

Thank for the reply. To simplify, this is gist with default TF-IDF
https://gist.github.com/tomaspet262/5829590 with the results (explain=true)
at the bottom.

More specifically, look at the part of the explain with id=5 and id=7 (for
match in section). The excerpt is here
https://picasaweb.google.com/tomas.pet262/21June2013#5891840719317352962 .
The id=7 scores higher, because the IDF is 1 (because in id=7 small occurs
only once). The id=7 scores lower, because the IDF is 0.306.

This is the opposite of what I want. I want id=5 to scores higher (not
other way around). How can I achieve that?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Do not know why gist is not working for me either :'(. Mirror is at
http://pastebin.com/1HVAYSn0

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Before trying to debug and fine-tune further, there are some other things
to look into.

First of all, TFIDF breaks down somewhat with small data sets. Scores begin
to stabilize as the index grows. The problem is exacerbate in elasticsearch
because TF-IDF is computed per shard and then the non-normalized results
are aggregated from each shard. Try one of two things:

Perhaps you already have done one of the above, but just wanted to double
check before digging deeper.

--
Ivan

On Fri, Jun 21, 2013 at 1:04 AM, Tomas Petulik tomas.pet262@gmail.comwrote:

Do not know why gist is not working for me either :'(. Mirror is at
http://pastebin.com/1HVAYSn0

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

On Friday, June 21, 2013 7:50:11 PM UTC+2, Ivan Brusic wrote:

I change search type -> results are ordered exactly as I want. Problem
solved :-). Thank you so much!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.