Can we manipulate the idf calculation field in Elasticsearch?


(ashit pupu) #1

Hi

I am facing an issue and I may sound silly in asking this but really need to know if any solution to this is available.
The condition being I have two fields title and statement. Title where as is a small field with not more words but statement is a multivalue field and can have any number of symptoms. Now the issue arises when we search for beautiful evening and there is match in both title and statement for exact word and as both field are provided with same boost so the general idea is both will be having same relevance score. But the issue arises because of idf whenever there is a match in title it that doc gets higher relevance than when there is a match in statement field.
I have tried copying both field into third field but that has failed because it also provided same boost to doc in which there is one word in title and another in statement.
The idea which I am looking at if by any possible way we could manipulate the idf calculation , while calculating the idf it should refer to a particular field in which we will copy both field so that the idf will remain same.

Any solution will help a lot.

Ashit


(Ryan Ernst) #2

Have you tried boosting the field you want to be more relevant?
https://www.elastic.co/guide/en/elasticsearch/guide/current/_boosting_query_clauses.html


(ashit pupu) #3

@rjernst yes we are boosting the field , we are using multimatch query to go in which both fields are boosted with boost factor of 4. As mentioned in problem statement even if both have same boost factor of 4, if there is same term match in both fields still the title field is getting higher relevance score as the idf is on the higher side. Whereas my requirement is both should have same relevance score.


(Ryan Ernst) #4

I think you should look at using BM25 similarity (which is the default starting in ES 5.0). It has parameters for controlling eg the influence of field length normalization.
https://www.elastic.co/guide/en/elasticsearch/guide/current/pluggable-similarites.html


(ashit pupu) #5

Thanks @rjernst but I am already using BM25 similarity plugin and the field length and term frequency has been normalized. The only thing which is creating nuisance in my case is idf of the keywords searched for. And normalizing them is a big issue to which I have not got any answer.


(Ryan Ernst) #6

But you say you don't want field length normalization. Set b to 0.0 if you don't want the length of the field (ie title being shorting on average than statement) to affect the score.


(ashit pupu) #7

Ahh !!! My bad. b and k both values have been set to 0 in the setting.


(system) #8

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.