TF-IDF is causing unwanted behavior for my query. I have a set of documents in the following format:
{
"name": "brown fox",
"description": "a sentence-long description"
}
Many documents have the same name with different descriptions. If I search something like to brown fox
, I want to receive all documents with name brown fox
because it is an exact match (or close to one).
Instead, the top hit is:
{
"yellow dog",
"the dog is not brown"
}
This document is the only one with brown
in its description so the TF-IDF score for that match is high. Meanwhile both brown
and fox
match the other document, but the TF-IDF score is low because of the duplicates.
Any tips on how to increase the score of the brown fox
documents?
Mapping: both fields are type text
use the standard
analyzer.
Query:
dis_max:
tie_breaker: 0.7
queries:
- match:
name: "{{search_string}}"
- match:
description: "{{search_string}}"
I don't want to disable tf-idf on name
because it helps in other search cases. Is it possible to either group together name
and description
for TF-IDF calculation so that a brown
in description
is not weighed higher than in name
? Or is it possible to stop the duplicate document names from increasing docFreq?
Thank you for any help! Let me know if I left anything out.