How to read indexed binary field data from doc values (in custom Query plugin)

Pyppe · July 27, 2023, 6:19am

Hi!

We're trying to write a custom query-plugin for Elasticsearch where we would use binary data for calculating scores (disclaimer: I've never written one before). Each document can have multiple vectors, so we cannot use the new dense vector type.

Binary mapping

We've made a binary mapping for the vector data:

"viewVectors": {
  "type": "binary",
  "doc_values": true,
  "store": true
}

Indexing binary data as base64 encoded string

And then we index the binary data in base64 encoded format:

val binaryData: Array[Byte] = calculateVectors(???)
document.viewVectors = java.util.Base64.getEncoder().encodeToString(binaryData)
// And index this via REST APIs

Trying to read binary data from DocValues

However, the problem is that when we're trying to get the binary data from context.reader.getBinaryDocValues(fieldName).binaryValue() the binary data is completely different than what we put there (we use Scala, but I hope it's pretty clear to Java developers as well):

case class VectorDistanceQuery(fieldName: String, searchVectors: List[Array[Long]]) extends Query {
  override def toString(field: String): String = s"VectorDistanceQuery(fieldName=$fieldName)"

  override def createWeight(searcher: IndexSearcher, scoreMode: ScoreMode, boost: Float): Weight = {
    new ConstantScoreWeight(this, boost) {
      override def scorer(context: LeafReaderContext): Scorer = {
        val values = context.reader.getBinaryDocValues(fieldName)
        if (values == null) {
          null
        } else {
          val iterator = new TwoPhaseIterator(values) {
            override def matches: Boolean = {
              // PROBLEM: The problem is when we read the bytes the data is not what we have indexed? 🤨
              val binaryData = values.binaryValue().bytes // Shouldn't this be the same as what we put in foobar.viewVectors (without base64 encoding)
              val documentSearchVectors = decodeFromBinary(bytes)
              ???
            }

            override def matchCost = 42 // TODO: what should it be?
          }
          new ConstantScoreScorer(this, boost, scoreMode, iterator)
        }
      }

      override def isCacheable(ctx: LeafReaderContext): Boolean = DocValues.isCacheable(ctx, fieldName)
    }
  }
}

What are we missing? Why is the binary data read from context.reader.getBinaryDocValues(fieldName) different from the indexed base64-encoded data java.util.Base64.getEncoder().encodeToString(binaryData)?

— Help appreciated!

system · August 24, 2023, 6:19am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Accessing BinaryDocValues from a plugin Elasticsearch	1	468	July 6, 2017
Converting ES binary to single valued binary Elasticsearch	1	348	October 4, 2020
Binary fields do not support searching es 6.0.0-rc1 Elasticsearch	7	1582	December 8, 2017
Reading field Values from docvalues using DoubleFieldSource Elasticsearch	6	1865	July 5, 2017
How to access doc values from expert script plugin Elasticsearch	2	1167	December 5, 2017

How to read indexed binary field data from doc values (in custom Query plugin)

Binary mapping

Indexing binary data as base64 encoded string

Trying to read binary data from DocValues

Related topics