I had a thought last night. I thought wait what if Lucene is dedupping the vector and I just didn’t realize it could do that. As in if you loaded two identical vectors into two separate fields would we detect and not store the raw vectors twice!
I tested that too. And as you might expect we don’t dedup those. This is fun though.
mapping
curl -XPUT --header 'Content-Type: application/json' "http://localhost:9200/test" -d '{
"mappings": {
"properties": {
"image-vector": {
"type": "dense_vector",
"dims": 64,
"similarity": "l2_norm",
"index": true,
"index_options": {
"type": "bbq_hnsw"
}
},
"image-vector2": {
"type": "dense_vector",
"dims": 64,
"similarity": "l2_norm",
"index": true,
"index_options": {
"type": "int8_hnsw"
}
}
}
}
}'
adding docs:
VECTOR=$(python -c 'import numpy as np; print(np.random.random(64).tolist())');
seq 1 10000 | xargs -I % -P1 curl -XPOST --header 'Content-Type: application/json' "http://localhost:9200/test/_doc" -d "
{ \"image-vector\": $VECTOR,
\"image-vector2\": $VECTOR }
"
relevant output of disk_usage:
"image-vector": {
"total": "2.6mb",
"total_in_bytes": 2801631,
"inverted_index": {
"total": "0b",
"total_in_bytes": 0
},
"stored_fields": "0b",
"stored_fields_in_bytes": 0,
"doc_values": "0b",
"doc_values_in_bytes": 0,
"points": "0b",
"points_in_bytes": 0,
"norms": "0b",
"norms_in_bytes": 0,
"term_vectors": "0b",
"term_vectors_in_bytes": 0,
"knn_vectors": "2.6mb",
"knn_vectors_in_bytes": 2801631
},
"image-vector2": {
"total": "3.1mb",
"total_in_bytes": 3261630,
"inverted_index": {
"total": "0b",
"total_in_bytes": 0
},
"stored_fields": "0b",
"stored_fields_in_bytes": 0,
"doc_values": "0b",
"doc_values_in_bytes": 0,
"points": "0b",
"points_in_bytes": 0,
"norms": "0b",
"norms_in_bytes": 0,
"term_vectors": "0b",
"term_vectors_in_bytes": 0,
"knn_vectors": "3.1mb",
"knn_vectors_in_bytes": 3261630
}
math:
# bbq_hnsw
10_000 * (64/8+14) + 10_000 * 16 + 10_000 * 64 * 4 = 2940000
# int8_hnsw
10_000 * 64 + 10_000 * 16 + 10_000 * 64 * 4 = 3360000