Knn_vectors field understanding

@RainTown , thanks a lot for sharing your thoughts and tests, I really appreciate it.

Hi @RainTown and @john-wagster, suppose that we will use synthetic source for dense vector field, may I ask whether you know the difference between the raw vectors obtained via for example the following query

POST https://127.0.0.1:9200/file_flat_1024_exclude_vec_3/_search
{
     "size": 5,
     "_source": false,
     "script_fields": {
       "raw_vector": {
         "script": {
           "source": "doc['file_section_embedding'].vectorValue"
        }
      }
    },
    "query": {
      "match_all": {}
     }
 }

and the vector values that we would obtained from synthetic source? Thanks a lot.

and the vector values that we would obtained from synthetic source? Thanks a lot.

I would expect that vectorValue and the stored synthetic source value are the same (should be I haven't tried it myself). But here's my reasoning. If you use synthetic source essentially what happens is we drill down into Lucene and retrieve the float[] that's used for comparisons while satisfying the query itself. (Here's the original PR that I went and took a look at to verify that behavior: Synthetic _source: support dense_vector by nik9000 · Pull Request #89840 · elastic/elasticsearch · GitHub)

My apologies I wasn't tracking some of the follow up but happy to talk through synthetic source more if it would help. I definitely think disabling _source will greatly reduce the disk foot print and then synthetic source is backed by the Lucene copy that you have to store anyway so it's not the original request vectors exactly (which is a string) but it will be close (the reason it's close and not exact has to do with floating point precision as it's stored a a float[] on disk in Lucene; as in an equals comparison would potentially fail but I would expect all vector operations will be identical).

1 Like

Hi @john-wagster , thanks a lot for sharing. I will reach out if we would go with the synthetic source approach.