Disabling Elasticsearch Inverse Document Frequency scoring on ES relevance score

patlola · February 15, 2017, 7:08am

Am getting irrelevant results when doing a simple match query on the documents.

My documents have a lot of duplicates words but all words are important like.

Documents are:
Doc_1) Nestle
Doc_2) Nestle Kitkat
DOc_3)nestle chocolate
doc_4)nestle candy
doc_5) nestle yoghurt
doc_6) nestle fruit
--hundreds of same kinda documents.

now on doing a simple match query on > nestle it's not scoring Doc_1 first because it's repeated number of times and it's not relevant because of IDF.

I have tried disabling norms using norms:{"enabled": false}, index_options:"docs"
in the field mapping but still am not getting relevant results.

{
  "query": {
    "bool": {
      "disable_coord": true,
      "should": [
        {
          "match": {
            "choclate.name": {
              "query": "Nestle",
              "operator": "and"
            }
          }
        },
        {
          "match": {
            "choclate.whitespace": {
              "query": "Nestle",
              "operator": "and"
            }
          }
        }
      ]
    }
  },
  "from": 0,
  "size": 1
}

do i need to use custom score fucntion ?
if i use custom score function then i won't get other relevance scoring features like field length and normalization etc etc.

jpountz · February 15, 2017, 9:00am

Unfortunately this is the kind of requirement that would require to plug in a custom similarity. This issue might do what you want: https://github.com/elastic/elasticsearch/issues/6731. It proposes to add a new similarity that does not take term frequency or document frequency into account, just the number of matching clauses, and document length if norms are enabled. You can upvote it if you think that would address your requirements.

patlola · February 15, 2017, 9:04am

Thanks for your reply @jpountz
instead of IDF i would like to have a DF only.

is there any possibility to do that ?

Sumit_Gupta · February 15, 2017, 9:05am

@jpountz Look like both of these requirements are same

jpountz · February 15, 2017, 9:14am

Then you would need a custom similarity indeed. Similarities in Lucene are not designed for being extended, but it should be fairly easy to copy-paste an existing impl (typically bm25) and adapt it to your needs.

softwaredoug · February 15, 2017, 10:27pm

One solution can be to set k to 0 in BM25, which I believe gives you just the BM25 IDF as an implementation when I look at the formula. More here.

sdauletau · February 16, 2017, 1:25am

I have a plugin that ignores tf-idf.

You can use it as an example and you can update tf and or idf methods to return term or document frequency.

github.com

sdauletau/elasticsearch-simple-similarity/blob/master/src/main/java/org/elasticsearch/index/similarity/SimpleSimilarity.java#L72


            scores.add(Explanation.match(1.0f, description));
            total += 1.0f;
        }
        this.score = Explanation.match(total, "total score, sum of:", scores);
        this.boost = Explanation.match(boost, "boost");
    }
}


public final SimScorer simScorer(SimWeight weight, LeafReaderContext context) throws IOException {
    SimpleScore score = (SimpleScore) weight;
    return new SimpleScorer(score);
}


private final class SimpleScorer extends SimScorer {
    private final SimpleScore score;


    SimpleScorer(SimpleScore score) throws IOException {
        this.score = score;
    }


    public float score(int doc, float freq) {

system · March 16, 2017, 1:25am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Score based on Term Frequency alone Elasticsearch	2	3922	May 23, 2017
How to completely disable Inverse document frequency? Elasticsearch	5	2020	September 19, 2018
Customizing relevant scoring in Elasticsearch Elasticsearch	2	961	July 5, 2017
Newbie quesiton re: document size & score Elasticsearch	3	334	July 6, 2017
How to complete disable TF-IDF? Elasticsearch	4	4779	February 6, 2017

Disabling Elasticsearch Inverse Document Frequency scoring on ES relevance score

Related topics