How to read indexed binary field data from doc values (in custom Query plugin)


We're trying to write a custom query-plugin for Elasticsearch where we would use binary data for calculating scores (disclaimer: I've never written one before). Each document can have multiple vectors, so we cannot use the new dense vector type.

Binary mapping

We've made a binary mapping for the vector data:

"viewVectors": {
  "type": "binary",
  "doc_values": true,
  "store": true

Indexing binary data as base64 encoded string

And then we index the binary data in base64 encoded format:

val binaryData: Array[Byte] = calculateVectors(???)
document.viewVectors = java.util.Base64.getEncoder().encodeToString(binaryData)
// And index this via REST APIs

Trying to read binary data from DocValues

However, the problem is that when we're trying to get the binary data from context.reader.getBinaryDocValues(fieldName).binaryValue() the binary data is completely different than what we put there (we use Scala, but I hope it's pretty clear to Java developers as well):

case class VectorDistanceQuery(fieldName: String, searchVectors: List[Array[Long]]) extends Query {
  override def toString(field: String): String = s"VectorDistanceQuery(fieldName=$fieldName)"
  override def createWeight(searcher: IndexSearcher, scoreMode: ScoreMode, boost: Float): Weight = {
    new ConstantScoreWeight(this, boost) {
      override def scorer(context: LeafReaderContext): Scorer = {
        val values = context.reader.getBinaryDocValues(fieldName)
        if (values == null) {
        } else {
          val iterator = new TwoPhaseIterator(values) {
            override def matches: Boolean = {
              // PROBLEM: The problem is when we read the bytes the data is not what we have indexed? ๐Ÿคจ
              val binaryData = values.binaryValue().bytes // Shouldn't this be the same as what we put in foobar.viewVectors (without base64 encoding)
              val documentSearchVectors = decodeFromBinary(bytes)
            override def matchCost = 42 // TODO: what should it be?
          new ConstantScoreScorer(this, boost, scoreMode, iterator)
      override def isCacheable(ctx: LeafReaderContext): Boolean = DocValues.isCacheable(ctx, fieldName)

What are we missing? Why is the binary data read from context.reader.getBinaryDocValues(fieldName) different from the indexed base64-encoded data java.util.Base64.getEncoder().encodeToString(binaryData)?

โ€” Help appreciated! :man_bowing:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.