Hello,
TLDR: What's the "indexation" difference between different similarities for dense_vector
field?
I have an index with filed dense_vector
defined with similarity: cosine
.
Now, I want to experiment with similarity:dot_product
.
For that, I added two additional fields to my mappings, which will be populated via the ingest pipeline.
- Original field
feature1_vector
withsimilarity: cosine
-> embeddings come from processortag: inference 1
- calls custom trained ML model, hosted in ES - New field
feature2_vector_inference
withsimilarity: dot_product
-> embeddings come from processortag: inference 2
, which is precisely the same as the previous one - New field
feature3_vector_copied
withsimilarity: dot_product
-> embeddings come fromset
processor, which copies them from the first fieldfeature1_vector
Mapping:
"feature1_vector": { -- ORIGINAL FIELD
"type": "dense_vector",
"dims": 384,
"index": true,
"similarity": "cosine",
"index_options": {
"type": "int8_hnsw",
"m": 16,
"ef_construction": 100
}
},
"feature2_vector_inference": { -- NEW FIELD TO BE POPULATED BY INFERENCE PROCESSOR
"type": "dense_vector",
"dims": 384,
"index": true,
"similarity": "dot_product",
"index_options": {
"type": "int8_hnsw",
"m": 16,
"ef_construction": 100
}
},
"feature3_vector_copied": { -- NEW FIELD TO BE POPULATED BY SET PROCESSOR
"type": "dense_vector",
"dims": 384,
"index": true,
"similarity": "dot_product",
"index_options": {
"type": "int8_hnsw",
"m": 16,
"ef_construction": 100
}
}
Ingest pipeline (shortened version):
[ {
"inference": {
"tag": "inference 1",
"model_id": "candidate_a",
"input_output": [
{
"input_field": "feature1",
"output_field": "feature1_vector"
}
],
"ignore_failure": false,
"on_failure": [...]
}
},
{
"inference": {
"tag": "inference 2",
"model_id": "candidate_a",
"input_output": [
{
"input_field": "feature1",
"output_field": "feature2_vector_inference"
}
],
"ignore_failure": false,
"on_failure": [...]
}
},
{
"set": {
"field": "feature3_vector_copied",
"copy_from": "feature1_vector"
}
}]
After reindexing some documents, I see all three fields containing precisely the same embeddings (I reference to _source
returned to my knn
query).
My questions:
- Is that expected?
- If so, why can't I select
similarity
during query time but only in index time? - Copying embeddings from
feature1_vector
tofeature3_vector_copied
with different similarities shouldn't simply fail?
Maybe someone will shed some light
Thanks!