How to increase relevancy for duplicate documents?


#1

TF-IDF is causing unwanted behavior for my query. I have a set of documents in the following format:

{
    "name": "brown fox",
    "description": "a sentence-long description"
}

Many documents have the same name with different descriptions. If I search something like to brown fox, I want to receive all documents with name brown fox because it is an exact match (or close to one).

Instead, the top hit is:

{
     "yellow dog",
     "the dog is not brown"
}

This document is the only one with brown in its description so the TF-IDF score for that match is high. Meanwhile both brown and fox match the other document, but the TF-IDF score is low because of the duplicates.

Any tips on how to increase the score of the brown fox documents?


Mapping: both fields are type text use the standard analyzer.

Query:

  dis_max:
      tie_breaker: 0.7
      queries:
          - match:
            name: "{{search_string}}"
          - match:
            description: "{{search_string}}"

I don't want to disable tf-idf on name because it helps in other search cases. Is it possible to either group together name and description for TF-IDF calculation so that a brown in description is not weighed higher than in name? Or is it possible to stop the duplicate document names from increasing docFreq?

Thank you for any help! Let me know if I left anything out.


(system) #2

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.