What's the best approach to balance the text similarity with different fields weights?

Say we have two docs

{"_id": 1, "title": "James Harden wins the MVP", "content": "xxxxxxxxxxxxxxxxxxxxxx"}

{"_id": 2, "title": "The new 007 movie comes!", "content": "xxxx James Bond xxxxxxxxxxxxx"}

And when users searched query James Bond, we may construct es query like this

GET docs/_search
{
  "query": {
    "multi_match": {
      "query": "james bond",
      "fields": ["title^3", "content"]
    }
  }
}

For the overweight of title field, doc 1 may score better than doc 2.

So my question is what the best approach to make sure doc 2 scores better doc 1.

Thanks for help!

Generally, if you blend strict and sloppier interpretations of a user query the docs that match best (strict AND sloppy) will rank higher.

In declining order of strictness:

  1. Phrase query (all terms must match and be next to each other in the text)
  2. AND query (all terms must appear somewhere in the text)
  3. OR query (at least one term must match)
  4. fuzzy query (at least one vaguely reminiscent term must match).

These can all be assembled into a single bool query in the should property.
The more clauses a document matches, the higher the score - the downside is it will be more costly to run.

2 Likes

I wrote an example of this in the following gist:

2 Likes

Thanks for help!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.