How does span near scoring work

Hi, I'm facing an issue with the keyword search I'm building where documents seem to rack up a huge score out of nowhere and overtake documents that seem to be more relevant. For example, if the query is "quick brown fox", huge wordy documents where seemingly the only somewhat relevant word is "for" (due to fuzzy matching) manages to overtake documents that have exact word matchings with either the entirety or part of "quick brown fox".

I believed the issue could have been due to the repeated "for" matching against the query. However, when I tested this by creating a document with many repetitions of the word "fox", it did not overtake the documents with "quick brown fox".

I'd appreciate any insights into how the scoring works for span near queries or guidance on how to prevent documents with weak matches from overpowering higher-quality matches.

I'm using a bool query over multiple span_near queries (with span_term) to accommodate partial matches. The query is split into segments to support documents that may not contain all the words in the original phrase. Each segment combination is wrapped in a span_near and passed into the should clause of the bool query. This will be how the search query look like when the query is "quick brown fox".

GET /connector-test-8.18-connector-new/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "span_near": {
            "clauses": [
              { "span_multi": { "match": { "fuzzy": { "body": { "value": "quick", "fuzziness": "auto" } } } } }
            ],
            "slop": 1,
            "in_order": true,
            "boost": 1
          }
        },
        {
          "span_near": {
            "clauses": [
              { "span_multi": { "match": { "fuzzy": { "body": { "value": "quick", "fuzziness": "auto" } } } } },
              { "span_multi": { "match": { "fuzzy": { "body": { "value": "brown", "fuzziness": "auto" } } } } }
            ],
            "slop": 2,
            "in_order": true,
            "boost": 2
          }
        },
        {
          "span_near": {
            "clauses": [
              { "span_multi": { "match": { "fuzzy": { "body": { "value": "quick", "fuzziness": "auto" } } } } },
              { "span_multi": { "match": { "fuzzy": { "body": { "value": "brown", "fuzziness": "auto" } } } } },
              { "span_multi": { "match": { "fuzzy": { "body": { "value": "fox", "fuzziness": "auto" } } } } }
            ],
            "slop": 3,
            "in_order": true,
            "boost": 3
          }
        },
        {
          "span_near": {
            "clauses": [
              { "span_multi": { "match": { "fuzzy": { "body": { "value": "fox", "fuzziness": "auto" } } } } }
            ],
            "slop": 1,
            "in_order": true,
            "boost": 1
          }
        },
        {
          "span_near": {
            "clauses": [
              { "span_multi": { "match": { "fuzzy": { "body": { "value": "brown", "fuzziness": "auto" } } } } }
            ],
            "slop": 1,
            "in_order": true,
            "boost": 1
          }
        },
        {
          "span_near": {
            "clauses": [
              { "span_multi": { "match": { "fuzzy": { "body": { "value": "brown", "fuzziness": "auto" } } } } },
              { "span_multi": { "match": { "fuzzy": { "body": { "value": "fox", "fuzziness": "auto" } } } } }
            ],
            "slop": 2,
            "in_order": true,
            "boost": 2
          }
        }
      ],
      "minimum_should_match": 1
    }
  }
}

Hey there @symphony and welcome to the community!

You may be interested in checking out the dis_max query, I think this will meet your needs for scoring relevance smoothing.