I have gist https://gist.github.com/anonymous/5813541 where I search for
'small'. But result where 'small' is just once is higher then the result
where 'small' is twice.
I want the results where the more frequent the 'small' is, the higher in
the results. So, I want DF, not IDF.
Is there any way to disable IDF or override idf() in Similarity class? Or
how can I solve the problem?
Your gist is not loading for my right now (github down?), so I will try to
answer without all the information.
In TF-IDF, the term frequency is used to weight which document is more
relevant, but the IDF is used to weigh which term in the query is more
relevant. For a single term query, IDF is irrelevant since it will be the
same for every document. You want to focus on the TF side of the algorithm.
First, you should have term frequencies enabled on the field. Second, the
field length is being normalized, so you might get some benefit from
omitting the norms.
I have gist elasticsearch example · GitHub where I search for
'small'. But result where 'small' is just once is higher then the result
where 'small' is twice.
I want the results where the more frequent the 'small' is, the higher in
the results. So, I want DF, not IDF.
Is there any way to disable IDF or override idf() in Similarity class? Or
how can I solve the problem?
Finally got your gist to load and noticed that you are using BM25. Never
used BM25 with elasticsearch/Lucene but my point about field length
normalization still remains. You can try changing the 'b' parameter. The
default value is 0.75. Posting your explain output will help.
I finally upgraded to 0.00, so I hope to play around with similarities soon.
Cheers,
Ivan
On Thu, Jun 20, 2013 at 7:53 AM, Ivan Brusic ivan@brusic.com wrote:
Your gist is not loading for my right now (github down?), so I will try to
answer without all the information.
In TF-IDF, the term frequency is used to weight which document is more
relevant, but the IDF is used to weigh which term in the query is more
relevant. For a single term query, IDF is irrelevant since it will be the
same for every document. You want to focus on the TF side of the algorithm.
First, you should have term frequencies enabled on the field. Second, the
field length is being normalized, so you might get some benefit from
omitting the norms.
I have gist elasticsearch example · GitHub where I search for
'small'. But result where 'small' is just once is higher then the result
where 'small' is twice.
I want the results where the more frequent the 'small' is, the higher in
the results. So, I want DF, not IDF.
Is there any way to disable IDF or override idf() in Similarity class? Or
how can I solve the problem?
On Thursday, June 20, 2013 4:53:11 PM UTC+2, Ivan Brusic wrote:
In TF-IDF, the term frequency is used to weight which document is more
relevant, but the IDF is used to weigh which term in the query is more
relevant. For a single term query, IDF is irrelevant since it will be the
same for every document
More specifically, look at the part of the explain with id=5 and id=7 (for
match in section). The excerpt is here https://picasaweb.google.com/tomas.pet262/21June2013#5891840719317352962 .
The id=7 scores higher, because the IDF is 1 (because in id=7 small occurs
only once). The id=7 scores lower, because the IDF is 0.306.
This is the opposite of what I want. I want id=5 to scores higher (not
other way around). How can I achieve that?
Before trying to debug and fine-tune further, there are some other things
to look into.
First of all, TFIDF breaks down somewhat with small data sets. Scores begin
to stabilize as the index grows. The problem is exacerbate in elasticsearch
because TF-IDF is computed per shard and then the non-normalized results
are aggregated from each shard. Try one of two things:
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.