I found the dot product score calculation for the vector of byte element_type is wrong: 0.5 + (dot_product(query, vector) / (32768 * dims))
which is same as : 0.5 + (dot_product(query, vector) / (128*128*2* dims))
refer to: dot_product for byte vector The fault is : we should not divide the dims.

When the element_type type if float, we use this formula, there is no dims param: (1 + dot_product(query, vector)) / 2 ,
which is same as: 0.5 + dot_product(query, vector) / 2

When we change float to byte, we scale the float value by 128 , and the dot_product value will scale to 128*128. To restore the same dot product value as float, we only need to divide 128*128=32768 , no need to divide dims.

here is a python test sample:

import numpy as np
a = np.array([0.2, 0.7, 0.5, 0.1])
a = a/np.linalg.norm(a) # norm it
b = np.array([0.4, 0.1, 0.6, 0.2])
b = b/np.linalg.norm(b) # norm it
dot_float = np.dot(a, b) # 0.7004
score_float = (1 + dot)/2 # 0.8502
#float to byte [-scale, scale]
scale = 128 # to be strict in range [-128, 127], the sacle should be 127
a2 = (a*scale).astype(np.int)
b2 = (b*scale).astype(np.int)
dot_byte = np.dot(a2, b2) # 11210
dot_float2 = dot_int/128/128 # 0.68420, which is similar with 0.7004
dims = 4
score_wrong = 0.5 + (dot_byte / (32768 * dims)) # 0.5855, which is far from 0.8502
score_right = 0.5 + (dot_byte / (32768)) # 0.8421, which is similar with 0.8502

The scoring for dot product using byte vectors is a design decision to keep the score in the range [0, 1] as noted in the JavaDocs for Lucene's VectorUtil#dotProductScore(byte[], byte[]). This requires the additional division by the number of dimensions. I would not consider this behavior a bug. There are different formulas that can be used for float to byte quantization depending on the context, so there is no way to make the scoring in this case comparable.

Just a quick follow up and some thoughts. I'm curious about the situation in which this occurs. Are you interested in the exact dot product of two vectors? For the most part, Lucene is designed with search and ranking in mind, so the design decision @Jack_Conradson mentions is relative to this. Concretely, (my assumption is that) the rank order of document vectors being compared with a query vector should be the same, as well as the relative distance between them. Do you have a situation where this assumption doesn't hold?

The document scores are important. In some cases, the scores will be shown to the user. For example, when searching image by image, if I use the float vectors, the most similar image score is about 1.0. When I use byte vectors, all the scores are near 0.5 (when the vector dimension is very big, e.g. 768 ), and they have very small differences. This may confuse the user.

And the more important is: in hybrid retrieval, this byte/float score difference does affect the order of documents. When search with both query and knn, the document score will be knn_score*knn_boost + query_score*search_boost. The documents order may be different in byte vector and float vector.

doc1_score_float > doc2_score_float, but doc1_score_byte < doc2_score_byte. The doc order will be different. Because the byte vector scores are very near, they nearly have no effect on hybrid scores. @joshdevins@Jack_Conradson

Hey @wangyanbin, thanks so much for the extra context and clarifying the use case. I can definitely see where this is causing you some problems!

I see the GitHub issue for Lucene that you opened recently as well. I believe we are all saying the same thing about the design decision being based on rank order as the underlying principle for the calculation decision. I would recommend that you follow through with the Lucene community to open an Enhancement as mentioned in the above issue. This would allow us in Elasticsearch to make use of what Lucene provides.

In the meantime, you are correct about the hybrid search problem. Two thoughts and recommendations come to mind:

Since the scales differ between float and byte vectors, if you haven't already tried, I would strongly recommend recalibrating the weights of your linear scoring function to optimise a search relevance metric (nDCG, MAP, etc.). We've had success with both simple grid searches over a small parameter space or more complex approaches such as Bayesian optimisation. This might alleviate some of the problems of combining BM25 and the byte-wise dot product scores.

As a replacement for a linear scoring function, we have found that we generally get better results using reciprocal rank fusion (RRF) for hybrid search. While this is not available in the current 8.7 series, it will be released very soon. See the related PR and master documentation for more details.

I hope this is helpful and that you are able to improve your hybrid ranking with byte vectors. Please let us know how you get on and if you find a good way forward.

Ah, I see now. I would defer to @Jack_Conradson but my suggestion would be to open an Enhancement against Elasticsearch as well and maybe we can find a solution that is more flexible for byte vectors.

Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.