Weird behavior of dot_product similarity on dense_vector field

Hello,

TLDR: What's the "indexation" difference between different similarities for dense_vector field?

I have an index with filed dense_vector defined with similarity: cosine.
Now, I want to experiment with similarity:dot_product.

For that, I added two additional fields to my mappings, which will be populated via the ingest pipeline.

  1. Original field feature1_vector with similarity: cosine -> embeddings come from processor tag: inference 1 - calls custom trained ML model, hosted in ES
  2. New field feature2_vector_inference with similarity: dot_product -> embeddings come from processor tag: inference 2, which is precisely the same as the previous one
  3. New field feature3_vector_copied with similarity: dot_product -> embeddings come from set processor, which copies them from the first field feature1_vector

Mapping:

"feature1_vector": { -- ORIGINAL FIELD
          "type": "dense_vector",
          "dims": 384,
          "index": true,
          "similarity": "cosine",
          "index_options": {
            "type": "int8_hnsw",
            "m": 16,
            "ef_construction": 100
          }
        },
"feature2_vector_inference": { -- NEW FIELD TO BE POPULATED BY INFERENCE PROCESSOR
          "type": "dense_vector",
          "dims": 384,
          "index": true,
          "similarity": "dot_product",
          "index_options": {
            "type": "int8_hnsw",
            "m": 16,
            "ef_construction": 100
          }
        },
"feature3_vector_copied": { -- NEW FIELD TO BE POPULATED BY SET PROCESSOR
          "type": "dense_vector",
          "dims": 384,
          "index": true,
          "similarity": "dot_product",
          "index_options": {
            "type": "int8_hnsw",
            "m": 16,
            "ef_construction": 100
          }
        }

Ingest pipeline (shortened version):

[    {
      "inference": {
        "tag": "inference 1",
        "model_id": "candidate_a",
        "input_output": [
          {
            "input_field": "feature1",
            "output_field": "feature1_vector"
          }
        ],
        "ignore_failure": false,
        "on_failure": [...]
      }
    },
    {
      "inference": {
        "tag": "inference 2",
        "model_id": "candidate_a",
        "input_output": [
          {
            "input_field": "feature1",
            "output_field": "feature2_vector_inference"
          }
        ],
        "ignore_failure": false,
        "on_failure": [...]
      }
    },
    {
      "set": {
        "field": "feature3_vector_copied",
        "copy_from": "feature1_vector"
      }
    }]

After reindexing some documents, I see all three fields containing precisely the same embeddings (I reference to _source returned to my knn query).

My questions:

  1. Is that expected?
  2. If so, why can't I select similarity during query time but only in index time?
  3. Copying embeddings from feature1_vector to feature3_vector_copied with different similarities shouldn't simply fail?

Maybe someone will shed some light :slight_smile:
Thanks!

Starting with 8.12, cosine automatically normalizes vectors, and the dot product calculation is used out of the box as a performance enhancement. You can read more details in the blog or the PR if you're interested.

1 Like

Thank you for your reply.

May I have two follow-up questions?

  1. Why keep both options cosine and dot_product for the dense_vector.similarity field if cosine internally uses dot_product for computing similarity (which is more efficient)? Are there other cases when defining cosine or dot_products makes actually a difference?
  2. Would indexation be faster if I define explicitly in index similarity: dot_product?

Thank you :slight_smile:

We would not remove a GA feature without warning to maintain backward compatibility. I think you'd have to benchmark to see if you saw a real difference in your models.