Hi!
We're trying to write a custom query-plugin for Elasticsearch where we would use binary data for calculating scores (disclaimer: I've never written one before). Each document can have multiple vectors, so we cannot use the new dense vector type.
Binary mapping
We've made a binary mapping for the vector data:
"viewVectors": {
"type": "binary",
"doc_values": true,
"store": true
}
Indexing binary data as base64 encoded string
And then we index the binary data in base64 encoded format:
val binaryData: Array[Byte] = calculateVectors(???)
document.viewVectors = java.util.Base64.getEncoder().encodeToString(binaryData)
// And index this via REST APIs
Trying to read binary data from DocValues
However, the problem is that when we're trying to get the binary data from context.reader.getBinaryDocValues(fieldName).binaryValue()
the binary data is completely different than what we put there (we use Scala, but I hope it's pretty clear to Java developers as well):
case class VectorDistanceQuery(fieldName: String, searchVectors: List[Array[Long]]) extends Query {
override def toString(field: String): String = s"VectorDistanceQuery(fieldName=$fieldName)"
โ
override def createWeight(searcher: IndexSearcher, scoreMode: ScoreMode, boost: Float): Weight = {
new ConstantScoreWeight(this, boost) {
override def scorer(context: LeafReaderContext): Scorer = {
val values = context.reader.getBinaryDocValues(fieldName)
if (values == null) {
null
} else {
val iterator = new TwoPhaseIterator(values) {
override def matches: Boolean = {
// PROBLEM: The problem is when we read the bytes the data is not what we have indexed? ๐คจ
val binaryData = values.binaryValue().bytes // Shouldn't this be the same as what we put in foobar.viewVectors (without base64 encoding)
val documentSearchVectors = decodeFromBinary(bytes)
???
}
โ
override def matchCost = 42 // TODO: what should it be?
}
new ConstantScoreScorer(this, boost, scoreMode, iterator)
}
}
โ
override def isCacheable(ctx: LeafReaderContext): Boolean = DocValues.isCacheable(ctx, fieldName)
}
}
}
What are we missing? Why is the binary data read from context.reader.getBinaryDocValues(fieldName)
different from the indexed base64-encoded data java.util.Base64.getEncoder().encodeToString(binaryData)
?
โ Help appreciated!