How can I rerank query results by using Euclidean distance on fields have datatype is vector in elasticsearch?

I have a question about Elasticsearch. Namely, I have some data about embedding vectors (dense vector) and their corresponding string tokens from a algorithm using K-Means to map them from high-dimensionality vector space into smaller subspace (text format) for full-text search engine Elasticsearch to fast query (Similarity searching).

And then I will get the results from Elasticsearch query phase to rescore (or rerank) it with Euclidean distance.

But this rescoring phase seems not working, results after rescoring lose similarities from query phase.

Here is my request body (json) for query and rescore with Elasticsearh:

	request_body_1 = {
		"size": s,
		"query": {
			"function_score": {
				"functions": string_tokens_body,
				"score_mode": "sum",
				"boost_mode": "replace"
			}
		},
		"rescore": {
			"window_size": r, # Get top-r results from query phase for rescoring with Eucliean distance.
			"query": {
				"rescore_query": {
					"function_score": {
						"script_score": {
							"script": {
								"lang": "painless",
								"source": """
									def sum = 0.0 ;
									for (def index = 0; index < params['_source']['embedding_vector'].length; index++) {
										sum += Math.pow(params.query_vector[index] - doc['embedding_vector'][index], 2);
									}
									return(Math.sqrt(sum));
								""",
								"params": {
									"query_vector": query_vector.tolist() # numpy array not working here.
								}
							}
						},
						"boost_mode": "replace"
					}
				},
				"query_weight": 0, # Remove scores from query phase.
				"rescore_query_weight": 1 # Just calculate scores according to *rescoring phase*.
			}
		}
	}

Here is an example my document for indexing to Elasticsearch:

{
"index": "my_project",
"type": "_doc",
"id": 1,
"source": {
"embedding_vector": [1.12, 2.24, 3,34, 4,45],
"other_field": "other_datatypes"
}
}

How can I solve this problem ?

Thanks in advance for any reply of you.

From 7.3, we have cosineSimilarity function available for a special field type dense_vector. For 7.4 l1norm and l2norm (euclidean distance) will be available as well.

If you want to use euclidean distance in the elasticsearch before that, then indeed you need to design a script something like you are doing it. One thing to note here is that it is incorrect to access doc['embedding_vector'][index] in script, if your field embedding_vector is a simple numeric field. Even if you index its values as an array in your json, inside the index the values will be stored as multiple values in a sorted way. Thus, for example, doc['embedding_vector'][3] will return you 4 instead of your expected 34. For the correct behaviour, you can instead parse the source as you are doing it: params['_source']['embedding_vector'][index], but this will be slower .

About your specific question about rescoring, can you elaborate more what did you mean by "results after rescoring lose similarities from query phase"? Does it mean that it looks like rescoring phase is not applied at all?

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.