How can I rerank query results by using Euclidean distance on fields have datatype is vector in elasticsearch?

gia_huy · July 19, 2019, 5:53am

I have a question about Elasticsearch. Namely, I have some data about embedding vectors (dense vector) and their corresponding string tokens from a algorithm using K-Means to map them from high-dimensionality vector space into smaller subspace (text format) for full-text search engine Elasticsearch to fast query (Similarity searching).

And then I will get the results from Elasticsearch query phase to rescore (or rerank) it with Euclidean distance.

But this rescoring phase seems not working, results after rescoring lose similarities from query phase.

Here is my request body (json) for query and rescore with Elasticsearh:

	request_body_1 = {
		"size": s,
		"query": {
			"function_score": {
				"functions": string_tokens_body,
				"score_mode": "sum",
				"boost_mode": "replace"
			}
		},
		"rescore": {
			"window_size": r, # Get top-r results from query phase for rescoring with Eucliean distance.
			"query": {
				"rescore_query": {
					"function_score": {
						"script_score": {
							"script": {
								"lang": "painless",
								"source": """
									def sum = 0.0 ;
									for (def index = 0; index < params['_source']['embedding_vector'].length; index++) {
										sum += Math.pow(params.query_vector[index] - doc['embedding_vector'][index], 2);
									}
									return(Math.sqrt(sum));
								""",
								"params": {
									"query_vector": query_vector.tolist() # numpy array not working here.
								}
							}
						},
						"boost_mode": "replace"
					}
				},
				"query_weight": 0, # Remove scores from query phase.
				"rescore_query_weight": 1 # Just calculate scores according to *rescoring phase*.
			}
		}
	}

Here is an example my document for indexing to Elasticsearch:

{
"index": "my_project",
"type": "_doc",
"id": 1,
"source": {
"embedding_vector": [1.12, 2.24, 3,34, 4,45],
"other_field": "other_datatypes"
}
}

How can I solve this problem ?

Thanks in advance for any reply of you.

mayya · July 23, 2019, 2:26pm

From 7.3, we have cosineSimilarity function available for a special field type dense_vector. For 7.4 l1norm and l2norm (euclidean distance) will be available as well.

If you want to use euclidean distance in the elasticsearch before that, then indeed you need to design a script something like you are doing it. One thing to note here is that it is incorrect to access doc['embedding_vector'][index] in script, if your field embedding_vector is a simple numeric field. Even if you index its values as an array in your json, inside the index the values will be stored as multiple values in a sorted way. Thus, for example, doc['embedding_vector'][3] will return you 4 instead of your expected 34. For the correct behaviour, you can instead parse the source as you are doing it: params['_source']['embedding_vector'][index], but this will be slower .

About your specific question about rescoring, can you elaborate more what did you mean by "results after rescoring lose similarities from query phase"? Does it mean that it looks like rescoring phase is not applied at all?

system · August 20, 2019, 2:26pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to search dense_vector Elasticsearch	2	2118	February 26, 2019
Custom function for Text Similarity Search Elasticsearch	2	1436	December 30, 2019
Problem Computing Euclidean Distance using script score query Elasticsearch	8	2361	October 15, 2018
Vector Scoring Elasticsearch	6	6203	June 18, 2017
Vector-Based search using cosineSimilarity Elasticsearch	4	391	August 11, 2022

How can I rerank query results by using Euclidean distance on fields have datatype is vector in elasticsearch?

Related topics