Combine random sampling with vector similarity scoring

Abraham_Sanders · April 15, 2020, 7:23am

I have an index with a dense_vector field on which I am doing text similarity search. (For reference, I followed the sample here:https://www.elastic.co/blog/text-similarity-search-with-vectors-in-elasticsearch)

In the documentation, it mentions that vector functions are applied linearly to all documents matching a query, and that a filter should be applied to restrict the number of documents that are scanned linearly. In my use case I would like to filter by date range, however my index is so large that I could get millions of docs matching just one day. If a user wanted to query for a week or a month, it could match hundreds of millions of documents - clearly not something I would want to scan linearly.

Ideally I would like to randomly sample N documents that match my date range and then pass that limited set to the linear time vector function. Something like this:

  "query": {
    "script_score": {
      "query": {
        "function_score": {
          "query": {
              ...date range filter here...
            }
          },
          "random_score": {},
          "boost_mode": "replace",
          "max_results": 10000 <----- IS SOMETHING LIKE THIS POSSIBLE?
        }
      },
      "script": {
        "source": "cosineSimilarity(params.query_vector, doc['my_dense_vector']) + 1.0",
        "params": {
          "query_vector": [0.1, 0.2, ...]
        }
      }
    }

To be clear, I am not asking how to use the search "size" parameter, rather I want to limit the inner query that passes its results to the script_score.

Thanks!

mayya · April 20, 2020, 10:18pm

If you are willing to drop your requirement about a random sampling, you can do the following things:

An elasticsearch query request has terminate_after parameter -- the maximum number of documents to collect for each shard, upon reaching which the query execution will terminate early. But this doesn't allow you to produce random sampling, as the collection always starts with documents with lower internal IDs and progressing to documents with higher internal IDs; if your index doesn't change you will always get the the same documents.

Another way to do this is to put cosine_similarity in rescoring:

{
  "query": {
    "range": {
      "timestamp": {
        "gte": "now-1d/d",
        "lt": "now/d"
      }
    }
  },
  "rescore": {
    "window_size": 1000,
    "query": {
      "rescore_query": {
        "script": {
          "source": "cosineSimilarity(params.query_vector, 'my_dense_vector') + 1.0",
          "params": {
            "query_vector": [4, 3.4, -0.2]
          }
        }
      },
      "query_weight": 0,
      "rescore_query_weight": 1
    }
  }
}

A very fast filter on range is executed and we apply an expensive cosine similarity calculation only to the first 1000 docs. Here there is no random sampling as well, you will get the same 1000 docs.

The only way to get a random sampling that I aware of is indeed apply random_score function. To get a random sampling you will need to apply this function to all documents ( or all documents selected by a filter) . But a good thing is that function is quite fast, so there should not be a problem applying it to millions of documents. So what you can do is use your function_score query with random_score function, and then rescore 1000 docs based on more expensive cosine_similarity function.

Abraham_Sanders · April 22, 2020, 3:30am

Hi @mayya, thank you for the suggestion! However, when trying it, I get this error:

"type" : "parsing_exception",
"reason" : "[script] query does not support [source]"

I am on elasticsearch 7.6.2, and here is my query:

GET /my_index/_search?size=50
{
  "_source": [ ... ],
  "query": {
    "match_all": {}
  },
  "rescore": {
    "window_size": 50,
    "query": {
      "rescore_query": {
        "script": {
          "source": "cosineSimilarity(params.query_vector, 'my_dense_vector') + 1.0",
          "params": {
            "query_vector": [0.1, 0.2, ...]
          }
        }
      },
      "query_weight": 0,
      "rescore_query_weight": 1
    }
  }
}

Thanks again for your assistance!

mayya · April 22, 2020, 10:23am

Sorry, I made a mistake omitting several lines. The rescore_query part needs a query, so it should be something like this:

"rescore_query" : {
    "script_score": {
        "query": {"match_all": {}},
        "script": {
            "source": "cosineSimilarity(params.query_vector, 'my_dense_vector') + 1.0",
            "params": {
                "query_vector": <query_vector>
            }
        }
    }
}

system · May 20, 2020, 10:24am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Use distance on dense vectors in relevance score (at query time) Elasticsearch	3	2120	March 3, 2020
Filter vector search results to get only relevant documents? Elasticsearch vector-search	11	2111	August 30, 2023
Vector-Based search using cosineSimilarity Elasticsearch	4	344	August 11, 2022
Exactly which documents are used for vector calculation Elasticsearch	3	585	November 12, 2019
How vector based text similarity works under the hood? Elasticsearch	4	791	July 15, 2020

Combine random sampling with vector similarity scoring

Related topics