Knn_vectors field understanding

yli · February 25, 2025, 8:53am

Based on my understanding of your comments, that the raw value of dense vector field will be stored in Lucene and again in _source if not excluded, so there is a duplication, and this resulted additional 40 GB storage on disk, right? How the configuration of 1 replica influences the storage additionally?

What about the quantized vectors, will they be duplicated as well in replica shards?

I attached partially the response from the disk usage api

{
    "_shards": {
        "total": 2,
        "successful": 2,
        "failed": 0
    },
    "file_flat_1024": {
        "store_size": "216.5gb",
        "store_size_in_bytes": 232469055909,
        "all_fields": {
            "total": "216.4gb",
            "total_in_bytes": 232444165546,
            "inverted_index": {
                "total": "2.9gb",
                "total_in_bytes": 3147344422
            },
            "stored_fields": "163.7gb",
            "stored_fields_in_bytes": 175830555711,
            "doc_values": "97.1mb",
            "doc_values_in_bytes": 101829561,
            "points": "92.2mb",
            "points_in_bytes": 96760144,
            "norms": "9.8mb",
            "norms_in_bytes": 10375998,
            "term_vectors": "0b",
            "term_vectors_in_bytes": 0,
            "knn_vectors": "49.5gb",
            "knn_vectors_in_bytes": 53257299710
        },
        "fields": {
            "_source": {
                "total": "163.5gb",
                "total_in_bytes": 175628654805,
                "inverted_index": {
                    "total": "0b",
                    "total_in_bytes": 0
                },
                "stored_fields": "163.5gb",
                "stored_fields_in_bytes": 175628654805,
                "doc_values": "0b",
                "doc_values_in_bytes": 0,
                "points": "0b",
                "points_in_bytes": 0,
                "norms": "0b",
                "norms_in_bytes": 0,
                "term_vectors": "0b",
                "term_vectors_in_bytes": 0,
                "knn_vectors": "0b",
                "knn_vectors_in_bytes": 0
            },
            "file_section_embedding": {
                "total": "49.5gb",
                "total_in_bytes": 53257299710,
                "inverted_index": {
                    "total": "0b",
                    "total_in_bytes": 0
                },
                "stored_fields": "0b",
                "stored_fields_in_bytes": 0,
                "doc_values": "0b",
                "doc_values_in_bytes": 0,
                "points": "0b",
                "points_in_bytes": 0,
                "norms": "0b",
                "norms_in_bytes": 0,
                "term_vectors": "0b",
                "term_vectors_in_bytes": 0,
                "knn_vectors": "49.5gb",
                "knn_vectors_in_bytes": 53257299710
            }
        }
    }
}

Btw, from the response of disk usage API, it seems that it returns the disk usage based on the primary shards, right?

Topic		Replies	Views
Some interesting storage numbers for people interested Elasticsearch	7	410	July 6, 2017
How to exclude dense_vector field from being stored Elasticsearch vector-search	7	1435	December 29, 2022
Custom _source compression / compaction to reduce disk usage Elasticsearch	13	1222	July 6, 2017
Disabling _source field Elasticsearch	22	2086	July 6, 2017
Dense vectors taking up much more space than expected Elasticsearch vector-search	2	420	November 8, 2024

Knn_vectors field understanding

Related topics