I need help writing a simple custom scoring function

I want to implement my own scoring algorithm for a use case that requires finding one distinct document from a query result set. This requires access to two key pieces of information:

  1. The number of query terms as analyzed by ES
  2. The number of analyzed terms in the document's field

I do not need access to the inverted index. I don't care what the terms are during scoring. All that's needed are the number of terms as stated above.

Consider a simple example set with the following analyzed terms:

doc1 = ["pizza"]
doc2 = ["cheese", "pizza"]
doc3 = ["large", "cheese", "pizza"]
doc4 = ["small", "cheese", "pizza"]

My desired scoring function could not be simpler once documents match full-text terms with an AND operation:

score = # analyzed query terms / # analyzed field terms

Consider these queries to find distinct matches above:

  1. I want to find doc1. The distinct query is "pizza" with the highest score of 1.

  2. I want to find doc2. The distinct query is "cheese pizza" with the highest score of 2/3.

  3. I want to find doc3. The distinct query is "large", "large pizza", "large cheese" or "large cheese pizza" with highest scores of 1/3, 2/3, 2/3 and 3/3, respectively.

  4. I want to find doc4. The distinct query is "small", "small cheese", "small pizza" or "small cheese pizza" with the highest scores of 1/3, 2/3, 2/3, and 3/3, respectively.

I'm a novice wrt painless scripting. Can I access the key parameters above per document to perform my simple scoring algorithm? And how do I write this scoring function?

Below are sample mappings, documents, an desired query, waiting for my custom scoring function.

Thanks!

PUT food
{
  "mappings": {
    "_doc": {
      "properties": {
        "name": {
          "type": "text",
          "analyzer": "english"
        }
      }
    }
  }
}
PUT food/_doc/1
{
  "name": "pizza"
}
PUT food/_doc/2
{
  "name": "cheese pizza"
}
PUT food/_doc/3
{
  "name": "large cheese pizza"
}
PUT food/_doc/4
{
  "name": "small cheese pizza"
}

GET food/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "name": {
              "query": "cheese pizza",
              "operator": "and",
              "fuzziness": "AUTO",
              "prefix_length": 2
            }
          }
        }
      ]
    }
  }
}

UPDATE: I found it's possible to store a field containing the number of terms analyzed (below). But how do I compute the number of terms of a param query as full text with the same analyzer in a painless script?

PUT food
{
  "mappings": {
    "_doc": {
      "properties": {
        "name": {
          "type": "text",
          "analyzer": "english",
          "store":true,
          "fields": {
            "length": {
              "type": "token_count",
              "analyzer": "english"
            }
          }
        }
      }
    }
  }
}

Then this query returns that field:

GET food/_search
{
  "_source": "*",
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "name": {
              "query": "cheese pizza",
              "operator": "and",
              "fuzziness": "AUTO",
              "prefix_length": 2
            }
          }
        }
      ]
    }
  },
  "script_fields": {
    "name_term_count": {
      "script": {
        "lang": "painless",
        "source": "doc['name.length']"
      }
    }
  }
}

Example hit:

  {
    "_index": "food",
    "_type": "_doc",
    "_id": "3",
    "_score": 0.5753642,
    "_source": {
      "name": "large cheese pizza"
    },
    "fields": {
      "name_term_count": [
        3
      ]
    }
  }

Of course I can run this first, but incurs round-trip network call just to obtain the query term count:

GET food/_analyze
{
  "field": "name",   
  "text": "cheese pizza large"
}

result:

{ tokens:
   [ { token: 'larg',
       start_offset: 0,
       end_offset: 5,
       type: '<ALPHANUM>',
       position: 0 },
     { token: 'chees',
       start_offset: 6,
       end_offset: 12,
       type: '<ALPHANUM>',
       position: 1 },
     { token: 'pizza',
       start_offset: 13,
       end_offset: 18,
       type: '<ALPHANUM>',
       position: 2 } ] }
terms: 3

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.