Combine random sampling with vector similarity scoring

I have an index with a dense_vector field on which I am doing text similarity search. (For reference, I followed the sample here:https://www.elastic.co/blog/text-similarity-search-with-vectors-in-elasticsearch)

In the documentation, it mentions that vector functions are applied linearly to all documents matching a query, and that a filter should be applied to restrict the number of documents that are scanned linearly. In my use case I would like to filter by date range, however my index is so large that I could get millions of docs matching just one day. If a user wanted to query for a week or a month, it could match hundreds of millions of documents - clearly not something I would want to scan linearly.

Ideally I would like to randomly sample N documents that match my date range and then pass that limited set to the linear time vector function. Something like this:

  "query": {
    "script_score": {
      "query": {
        "function_score": {
          "query": {
              ...date range filter here...
            }
          },
          "random_score": {},
          "boost_mode": "replace",
          "max_results": 10000 <----- IS SOMETHING LIKE THIS POSSIBLE?
        }
      },
      "script": {
        "source": "cosineSimilarity(params.query_vector, doc['my_dense_vector']) + 1.0",
        "params": {
          "query_vector": [0.1, 0.2, ...]
        }
      }
    }

To be clear, I am not asking how to use the search "size" parameter, rather I want to limit the inner query that passes its results to the script_score.

Thanks!

If you are willing to drop your requirement about a random sampling, you can do the following things:

An elasticsearch query request has terminate_after parameter -- the maximum number of documents to collect for each shard, upon reaching which the query execution will terminate early. But this doesn't allow you to produce random sampling, as the collection always starts with documents with lower internal IDs and progressing to documents with higher internal IDs; if your index doesn't change you will always get the the same documents.

Another way to do this is to put cosine_similarity in rescoring:

{
  "query": {
    "range": {
      "timestamp": {
        "gte": "now-1d/d",
        "lt": "now/d"
      }
    }
  },
  "rescore": {
    "window_size": 1000,
    "query": {
      "rescore_query": {
        "script": {
          "source": "cosineSimilarity(params.query_vector, 'my_dense_vector') + 1.0",
          "params": {
            "query_vector": [4, 3.4, -0.2]
          }
        }
      },
      "query_weight": 0,
      "rescore_query_weight": 1
    }
  }
}

A very fast filter on range is executed and we apply an expensive cosine similarity calculation only to the first 1000 docs. Here there is no random sampling as well, you will get the same 1000 docs.


The only way to get a random sampling that I aware of is indeed apply random_score function. To get a random sampling you will need to apply this function to all documents ( or all documents selected by a filter) . But a good thing is that function is quite fast, so there should not be a problem applying it to millions of documents. So what you can do is use your function_score query with random_score function, and then rescore 1000 docs based on more expensive cosine_similarity function.

Hi @mayya, thank you for the suggestion! However, when trying it, I get this error:

"type" : "parsing_exception",
"reason" : "[script] query does not support [source]"

I am on elasticsearch 7.6.2, and here is my query:

GET /my_index/_search?size=50
{
  "_source": [ ... ],
  "query": {
    "match_all": {}
  },
  "rescore": {
    "window_size": 50,
    "query": {
      "rescore_query": {
        "script": {
          "source": "cosineSimilarity(params.query_vector, 'my_dense_vector') + 1.0",
          "params": {
            "query_vector": [0.1, 0.2, ...]
          }
        }
      },
      "query_weight": 0,
      "rescore_query_weight": 1
    }
  }
}

Thanks again for your assistance!

Sorry, I made a mistake omitting several lines. The rescore_query part needs a query, so it should be something like this:

"rescore_query" : {
    "script_score": {
        "query": {"match_all": {}},
        "script": {
            "source": "cosineSimilarity(params.query_vector, 'my_dense_vector') + 1.0",
            "params": {
                "query_vector": <query_vector>
            }
        }
    }
}

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.