Dense Vector Field Extremely Large

I had a thought last night. I thought wait what if Lucene is dedupping the vector and I just didn’t realize it could do that. As in if you loaded two identical vectors into two separate fields would we detect and not store the raw vectors twice!

I tested that too. And as you might expect we don’t dedup those. This is fun though.

mapping

curl -XPUT --header 'Content-Type: application/json' "http://localhost:9200/test" -d '{
  "mappings": {
    "properties": {
      "image-vector": {
        "type": "dense_vector",
        "dims": 64,
        "similarity": "l2_norm",
        "index": true,
        "index_options": {
          "type": "bbq_hnsw"
        }
      },
      "image-vector2": {
        "type": "dense_vector",
        "dims": 64,
        "similarity": "l2_norm",
        "index": true,
        "index_options": {
          "type": "int8_hnsw"
        }
      }      
    }
  }
}'

adding docs:

VECTOR=$(python -c 'import numpy as np; print(np.random.random(64).tolist())');
seq 1 10000 | xargs -I % -P1 curl -XPOST --header 'Content-Type: application/json' "http://localhost:9200/test/_doc" -d "
    { \"image-vector\": $VECTOR,
	  \"image-vector2\": $VECTOR }
"

relevant output of disk_usage:

            "image-vector": {
                "total": "2.6mb",
                "total_in_bytes": 2801631,
                "inverted_index": {
                    "total": "0b",
                    "total_in_bytes": 0
                },
                "stored_fields": "0b",
                "stored_fields_in_bytes": 0,
                "doc_values": "0b",
                "doc_values_in_bytes": 0,
                "points": "0b",
                "points_in_bytes": 0,
                "norms": "0b",
                "norms_in_bytes": 0,
                "term_vectors": "0b",
                "term_vectors_in_bytes": 0,
                "knn_vectors": "2.6mb",
                "knn_vectors_in_bytes": 2801631
            },
            "image-vector2": {
                "total": "3.1mb",
                "total_in_bytes": 3261630,
                "inverted_index": {
                    "total": "0b",
                    "total_in_bytes": 0
                },
                "stored_fields": "0b",
                "stored_fields_in_bytes": 0,
                "doc_values": "0b",
                "doc_values_in_bytes": 0,
                "points": "0b",
                "points_in_bytes": 0,
                "norms": "0b",
                "norms_in_bytes": 0,
                "term_vectors": "0b",
                "term_vectors_in_bytes": 0,
                "knn_vectors": "3.1mb",
                "knn_vectors_in_bytes": 3261630
            }

math:

# bbq_hnsw
10_000 * (64/8+14) + 10_000 * 16 + 10_000 * 64 * 4 = 2940000

# int8_hnsw
10_000 * 64 + 10_000 * 16 + 10_000 * 64 * 4 = 3360000