Vector functions are not available in Sort Context? [painless] [painful]

Dear team,

I am trying to use dotProduct() in a script sort and failing. The simplified case looks like this:

Setting up the data

PUT test_index
{
  "mappings": {
    "properties": {
      "v": {
        "type": "dense_vector",
        "dims" : 3
      }
    }
  }
}

POST test_index/_bulk
{ "index" : { "_id" : "1" } }
{"v" : [10, 10, 10]}
{ "index" : { "_id" : "2" } }
{"v" : [10, 20, 30]}

** Searching:**

POST test_index/_search
{
  "query" : {
    "match_all": {}
  },
  "sort": [
    {
      "_script": {
        "script": {
        "source": "def xt = params.filterVector; return dotProduct(xt,'v')", 
        "params": {
          "filterVector": [10, 10, 10]
        }
      },  
      "type": "number"
    }
    }
  ]
}

This fails with the following:

{
  "error": {
    "root_cause": [
      {
        "type": "script_exception",
        "reason": "compile error",
        "script_stack": [
          "... ams.filterVector; return dotProduct(xt,'v')",
          "                             ^---- HERE"
        ],
        "script": "def xt = params.filterVector; return dotProduct(xt,'v')",
        "lang": "painless",
        "position": {
          "offset": 37,
          "start": 12,
          "end": 55
        }
      }
    ],
    "type": "search_phase_execution_exception",
    "reason": "all shards failed",
    "phase": "query",
    "grouped": true,
    "failed_shards": [
      {
        "shard": 0,
        "index": "test_index",
        "node": "ZbZZiYwFQ1G1mdYgHlSl2w",
        "reason": {
          "type": "script_exception",
          "reason": "compile error",
          "script_stack": [
            "... ams.filterVector; return dotProduct(xt,'v')",
            "                             ^---- HERE"
          ],
          "script": "def xt = params.filterVector; return dotProduct(xt,'v')",
          "lang": "painless",
          "position": {
            "offset": 37,
            "start": 12,
            "end": 55
          },
          "caused_by": {
            "type": "illegal_argument_exception",
            "reason": "Unknown call [dotProduct] with [2] arguments."
          }
        }
      }
    ]
  },
  "status": 400
}

Now, it will work if I rewrite the query like this:

POST test_index/_search
{
  "query": {
    "script_score" : {
      "query" : {"match_all" : {}},
      "script": {
        "source": """
            def xt = params.filterVector;
            return dotProduct(xt, 'v');
        """, 
        "params": {
          "filterVector": [10, 10, 10] 
        }
      }
    }
  }
}

This thread made me think that the thing that I am trying to do is already implemented. However, I see that the related PR was never merged with the response We have redesigned vector functions to expect ScoreScript parameter.

Can you confirm that what I am trying to do is indeed not possible?

@MarynaCherniavska

You are correct, it is not possible. You must use the vector functions for scoring, not sorting.

Could you explain more deeply why scoring in this way doesn't work? Why sorting is mandatory?

@BenTrent thanks for the quick response! I am not yet sure if it does work, or it doesn't. The idea is to provide a top X elements from the list, based on the result of the scoring function, which is in turn based on the dot product of the vectors.

The score_script is described as follows:

Blockquote
Uses a script to provide a custom score for returned documents.

The script_score query is useful if, for example, a scoring function is expensive and you only need to calculate the score of a filtered set of documents.

This makes me think that the order of actions here would be to "select top 20, then sort by score", rather than "sort by score, then select top 20", which might lead to different results, right? Or is this one and the same thing?

Hey @MarynaCherniavska you can score on a subset or over all documents and it will be sorted by the _score. You can even rescore previously scored documents per shard: Filter search results | Elasticsearch Guide [7.17] | Elastic

How many documents are we talking here? If you have 100s of thousands, you may want to use approximate nearest neighbors (knn search).

Or, if you are simply rescoring a small subset (lower end of 10s of thousands), script score could be performant enough.

The average count on the indexes in question is ~500K records. What would you advise for such a case?

If you are querying all 500k (and there is no filter or query applied to reduce the number), and they are all on the same shard, I would recommend indexing them for KNN search.

Exact search works great at smaller scales (only scoring 10k documents for vectors). Once you need to find the nearest out of 100s of thousands or millions+, it becomes computationally expensive and slow.

But, the only way to know is to test your relevancy and performance on your data and determine what is acceptable :slight_smile:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.