Knn_vectors field understanding

Hi, I have defined one field, named "file_section_embedding", in my index mapping to be dense_vector and enabled index for it.

I also tested both including and excluding the dense vector field in/from the "_source", and both option gives me the same amount of storage size in this field, which is about 49.6gb. I have 10376000 documents, and therefore 10376000 vectors, which aligns roughly based on the following calculation with the default int8_hnsw quantization.

> float((10376000*1024*4)/(1024**3)) ~= 39.59 GB
> 39.59*0.25 ~= 9.9 GB

But I have the following question about the above observation

  1. I saw that after excluding the "file_section_embedding" field from _source, there is an obvious drop of storage size in the _source field, from 163.5gb to 5.7gb, which I think was due to not keeping the raw vectors in the _source field, right? But what is the structure to keep the dense vectors in _source that consumes so much disk space?
  2. I feel a bit strange that the storage size of the "file_section_embedding" field is the same including or excluding from _source, which leaves me the impression that the raw vector values are still kept in the field. I tested it out with the following query against the index which excluded it from _source
POST https://127.0.0.1:9200/file_flat_1024_exclude_vec_3/_search
{
     "size": 5,
     "_source": false,
     "script_fields": {
       "raw_vector": {
         "script": {
           "source": "doc['file_section_embedding'].vectorValue"
        }
      }
    },
    "query": {
      "match_all": {}
     }
 }

I also read the official documentation from Elasticsearch that the raw vector values are kept Dense vector field type | Elasticsearch Guide [8.17] | Elastic

Quantization will continue to keep the raw float vector values on disk for reranking, reindexing, and quantization improvements over the lifetime of the data. This means disk usage will increase by ~25% for int8 , ~12.5% for int4 , and ~3.1% for bbq due to the overhead of storing the quantized and raw vectors.

2.1 Does the reranking happened automatically when we index new documents?
2.2 What the quantization improvements include?
2.3 Does it mean that we could still reindex even excluding dense vector field from _source?
2.4 Does it mean that we could still use the raw vector values for rescore? 2.5 Could we export all the raw vectors from the index which contains large amount of vectors? like 40GB or even more in our case.

Thanks a lot.

Let me walk you through some of this and hopefully that will help a bit.

Also here's a related to discussion you can take a look at: Dense vectors taking up much more space than expected

The way this gets stored is we use Lucene under the hood to store both the quantized (int8 in this case) representation, which in your case is the 9.9GB of data AND the raw vectors at as you accurately computed about 39.59GB. This is for comparison, reranking, and retrieval. Minimally I would expect to see this disk usage with your model and number of documents always independent of your usage of _source.

_source adds to this total because it's a separate store of the raw vectors primarily used for making it easier to do reindexing. I would highly recommend you don't start with this enabled. If you read through the link above you'll see I mention synthetic source, which is a great option as it completely removes the storage component of _source but I do believe it is or is transitioning toward a paid feature now. So with _source enabled just from that one field I would expect ~40GB + ~10GB + ~40GB = 90GB of storage.

The numbers you mentioned seem a little odd to me, but here's my guess. The 163.5GB sounds like roughly double the cost of the storage I would expect with _source enabled, which likely means you are using the default of 1 primary shard and 1 replica so can expect twice the disk usage. The numbers don't quite line up but if you dig in you might find that's roughly what you are seeing. 5.7GB doesn't make a whole lot of sense to me unless it's without indexing all of the vectors or some other set of fields. I'm not sure where that number is coming form. But you may be able to get more information by using the disk usage api.

As for your questions:

2.1 Does the reranking happened automatically when we index new documents?

No reranking happens at query time once the set of candidates has been retrieved they are subsequently reranked. Doing reranking in any way prior to this would be extremely expensive.

2.2 What the quantization improvements include?

Quantization as a process allows more of the vectors to fit into an HNSW structure in memory and so the more quantization you can tolerate the less RAM you need to make your vector queries efficient. For instance we recently did a lot of work here to further reduce the compression ration with BBQ as you mentioned which is a form of advanced scalar quantization. In our experiments I've seen it's maintaining about the same quality as int8 but at a 32x compression ratio rather than just 4x. Highly recommend it but it's more about reducing RAM than disk.

2.3 Does it mean that we could still reindex even excluding dense vector field from _source ?

No if you do not have _source enabled or use something like synthetic source for that field you won't be able to easily reindex it and would have to repopulate it from some kind of external source system.

2.4 Does it mean that we could still use the raw vector values for rescore?

I believe _source has no impact on rescoring. The raw vectors can be used for rescoring independent of that.

2.5 Could we export all the raw vectors from the index which contains large amount of vectors? like 40GB or even more in our case.

If you have _source enabled then yes. We've talked about exposing this through Lucene but don't have a great way of doing so yet. So as of right now without _source or synthetic source you can't do this right now. Without knowing the full details myself synthetic source isn't something that just grabs the raw vectors from Lucene and rebuilds a _source field it's a little more complex than that and directly exposing what's stored in Lucene is non-trivial. There might be a way to script this but it's definitely not supported or documented.

Happy to answer more questions though if you have any.

Hi John,

Thanks a lot for your quick reply, I really appreciate your answers.

I actually got the disk usage of 163.5GB and 5.7GB in _source including and excluding dense vector field "file_section_embedding" from _source via the disk usage api.

I am using the docker-compose setup of Elasticsearch following the official tutorial at Install Elasticsearch with Docker | Elasticsearch Guide [8.17] | Elastic, and configured 2 shards and 1 replica for the index. Based on this setting, I expect that 2 primary shards and 2 replica shards will be created. This should be true based on the following partial output from the index stats api

"_shards": {
        "total": 4,
        "successful": 4,
        "failed": 0
    },
    "_all": {
        "primaries": {
            "docs": {
                "count": 10376000,
                "deleted": 0,
                "total_size_in_bytes": 91095876995
            },
            "shard_stats": {
                "total_count": 2
            },
......

Based on my understanding of your comments, that the raw value of dense vector field will be stored in Lucene and again in _source if not excluded, so there is a duplication, and this resulted additional 40 GB storage on disk, right? How the configuration of 1 replica influences the storage additionally?

What about the quantized vectors, will they be duplicated as well in replica shards?

I attached partially the response from the disk usage api

{
    "_shards": {
        "total": 2,
        "successful": 2,
        "failed": 0
    },
    "file_flat_1024": {
        "store_size": "216.5gb",
        "store_size_in_bytes": 232469055909,
        "all_fields": {
            "total": "216.4gb",
            "total_in_bytes": 232444165546,
            "inverted_index": {
                "total": "2.9gb",
                "total_in_bytes": 3147344422
            },
            "stored_fields": "163.7gb",
            "stored_fields_in_bytes": 175830555711,
            "doc_values": "97.1mb",
            "doc_values_in_bytes": 101829561,
            "points": "92.2mb",
            "points_in_bytes": 96760144,
            "norms": "9.8mb",
            "norms_in_bytes": 10375998,
            "term_vectors": "0b",
            "term_vectors_in_bytes": 0,
            "knn_vectors": "49.5gb",
            "knn_vectors_in_bytes": 53257299710
        },
        "fields": {
            "_source": {
                "total": "163.5gb",
                "total_in_bytes": 175628654805,
                "inverted_index": {
                    "total": "0b",
                    "total_in_bytes": 0
                },
                "stored_fields": "163.5gb",
                "stored_fields_in_bytes": 175628654805,
                "doc_values": "0b",
                "doc_values_in_bytes": 0,
                "points": "0b",
                "points_in_bytes": 0,
                "norms": "0b",
                "norms_in_bytes": 0,
                "term_vectors": "0b",
                "term_vectors_in_bytes": 0,
                "knn_vectors": "0b",
                "knn_vectors_in_bytes": 0
            },
            "file_section_embedding": {
                "total": "49.5gb",
                "total_in_bytes": 53257299710,
                "inverted_index": {
                    "total": "0b",
                    "total_in_bytes": 0
                },
                "stored_fields": "0b",
                "stored_fields_in_bytes": 0,
                "doc_values": "0b",
                "doc_values_in_bytes": 0,
                "points": "0b",
                "points_in_bytes": 0,
                "norms": "0b",
                "norms_in_bytes": 0,
                "term_vectors": "0b",
                "term_vectors_in_bytes": 0,
                "knn_vectors": "49.5gb",
                "knn_vectors_in_bytes": 53257299710
            }
        }
    }
}

Btw, from the response of disk usage API, it seems that it returns the disk usage based on the primary shards, right?

Synthetic _source is a paid (Enterprise license) feature from 8.17+. That change has been discussed here a bit recently.

currently it is footnote #15.

Based on my understanding of your comments, that the raw value of dense vector field will be stored in Lucene and again in _source if not excluded, so there is a duplication,

Yes correct there is duplication

and this resulted additional 40 GB storage on disk, right?

Yes that's correct as well

How the configuration of 1 replica influences the storage additionally?

What about the quantized vectors, will they be duplicated as well in replica shards?

I believe it doubles the storage requirements for everything as the replica needs to be able to take over in the event the primary node is down. As in a query to the replica needs to be able to return the same answer as the primary.

Btw, from the response of disk usage API, it seems that it returns the disk usage based on the primary shards, right?

I believe what it does is wait for at least one shard to respond so it could be the replica or the primary, but suffice it in your case they should be identical. And yes it should be just one primary or in your case it looks like you have 2 primary shards that likely responded.

The output is interesting if somewhat cryptic to think through. So what that's telling us is that there's 216.5GB total being used by the 2 primary shards you have setup. Each primary will have 1 replica for a total of 4 shards and 2 copies of the data spread evenly across all 4 shards.

Let's walk through your disk analyzer output here. I think it makes sense:

So you have:

raw vectors bytes = 10376000 vectors * 4 bytes (float) * 1024 dimensions 
                  = 53257299710
                  ~ 49.60gb

which aligns with the output of the disk analyzer for the file_section_embedding field

In addition to that you will have the quantized representation:

quantized vectors bytes = 10376000 vectors * 1 byte (int8) * 1024
                        = 10625024000
                        ~ 9.99gb

In addition to that you will have the _source stored for the vectors:

_source vectors bytes = 10376000 vectors * 4 bytes (float) * 1024 dimensions 
                      = 53257299710
                      ~ 49.60gb

So between your 2 primary shards you will have a total of (approximately):

total field bytes = 53257299710 + 10625024000 + 53257299710 
                  = 53257299710 + 10625024000 + 53257299710 
                  = 117139623420
                  ~ 109gb

That all gets doubled onto your 1 replica so:

replica + primary = 117139623420 * 2
                  = 234279246840
                  ~ 218.19gb

The math didn't quite work out (there's probably some missing pieces I'm not accounting for) but roughly speaking it's pretty close 218.19gb vs what's reported by disk analyzer of 216.5gb.

Hi John, thanks a lot for your details analysis.

Actually, from the index management view on Kibana, the total usage is double from 216.5gb

So, it seems that the 216.5gb should be only for the primary shards or replica shards.

Oh interesting I haven't used this disk analyzer api much. Gosh I'm not sure what the diff is there.

Let me go poke around a bit and see if I can figure out why the numbers don't make sense.

Looks to me like the assumption of how _source is stored is not correct. The calc for raw vectors bytes and quantized vectors bytes tallies with output from disk usage api, but not so the _source calculation.

Hi @john-wagster and @RainTown,

thanks a lot for your replies.

The following contains the partial disk usage api response after excluding the dense vector field from _source

{
    "_shards": {
        "total": 2,
        "successful": 2,
        "failed": 0
    },
    "sdx_file_flat_1024_exclude_vec_3": {
        "store_size": "63.6gb",
        "store_size_in_bytes": 68343021736,
        "all_fields": {
            "total": "63.6gb",
            "total_in_bytes": 68327502483,
            "inverted_index": {
                "total": "2.9gb",
                "total_in_bytes": 3133941756
            },
            "stored_fields": "10.9gb",
            "stored_fields_in_bytes": 11706725438,
            "doc_values": "97.9mb",
            "doc_values_in_bytes": 102683766,
            "points": "90mb",
            "points_in_bytes": 94450574,
            "norms": "9.8mb",
            "norms_in_bytes": 10375997,
            "term_vectors": "0b",
            "term_vectors_in_bytes": 0,
            "knn_vectors": "49.6gb",
            "knn_vectors_in_bytes": 53279324952
        },
        "fields": {
            "_recovery_source": {
                "total": "5gb",
                "total_in_bytes": 5447105611,
                "inverted_index": {
                    "total": "0b",
                    "total_in_bytes": 0
                },
                "stored_fields": "5gb",
                "stored_fields_in_bytes": 5446786083,
                "doc_values": "312kb",
                "doc_values_in_bytes": 319528,
                "points": "0b",
                "points_in_bytes": 0,
                "norms": "0b",
                "norms_in_bytes": 0,
                "term_vectors": "0b",
                "term_vectors_in_bytes": 0,
                "knn_vectors": "0b",
                "knn_vectors_in_bytes": 0
            },
            "_source": {
                "total": "5.7gb",
                "total_in_bytes": 6160547735,
                "inverted_index": {
                    "total": "0b",
                    "total_in_bytes": 0
                },
                "stored_fields": "5.7gb",
                "stored_fields_in_bytes": 6160547735,
                "doc_values": "0b",
                "doc_values_in_bytes": 0,
                "points": "0b",
                "points_in_bytes": 0,
                "norms": "0b",
                "norms_in_bytes": 0,
                "term_vectors": "0b",
                "term_vectors_in_bytes": 0,
                "knn_vectors": "0b",
                "knn_vectors_in_bytes": 0
            },
            "file_section_embedding": {
                "total": "49.6gb",
                "total_in_bytes": 53279324952,
                "inverted_index": {
                    "total": "0b",
                    "total_in_bytes": 0
                },
                "stored_fields": "0b",
                "stored_fields_in_bytes": 0,
                "doc_values": "0b",
                "doc_values_in_bytes": 0,
                "points": "0b",
                "points_in_bytes": 0,
                "norms": "0b",
                "norms_in_bytes": 0,
                "term_vectors": "0b",
                "term_vectors_in_bytes": 0,
                "knn_vectors": "49.6gb",
                "knn_vectors_in_bytes": 53279324952
            }
        }
    }
}

You could see that there is a quite large drop in storage size of _source field.

What does your mapping look like? Is it only one field or is it multiple fields include that dense_vector field?

I ask just because I wouldn't have expected anything in _source if you are excluding the dense_vector field and it's the only one.

Hi @john-wagster , you could reference the 1st mapping that I posted in Dense vector field in nested object - Elastic Stack / Elasticsearch - Discuss the Elastic Stack, from which I got the index testing results here. The file_section_text field should consume the most amount of storage after the file_section_embedding field.

I went back and read some of the docs as well on source and tuning for disk usage:

I'm not sure I have a good explanation for your large number in kibana but seeing your mapping may help. Might also help to see some example docs you've indexed. Digging into _source will probably better explain some of what you are seeing though.

The _source field is essentially a blob field of the json that was passed as part of the request when indexing a document, not the actual data. It by default get's LZ4 compressed (although you can configure further compression). And so for most regular data you might expect the _source to be fairly compressible. However, for the vector data it probably largely looks random to LZ4 and so it likely is at least 2x the cost of the actual stored vectors. It's worth noting that this means it's not floats or vectors that are getting stored and my back of the napkin math earlier to do 1024 * docs * 4 is incorrect. The best way to figure on _source would be to actually count the total number of characters in the indexed json and lz4 compress it for instance a float represented in the request source as "1.1" is compressed and stored as 3 characters which is very different than a float like "1.1239898123" which is compressed and stored as 12 characters despite both of them being processed and indexed as 4 byte floats outside of _source. This also means keywords are indexed in a largely intuitive size, but in general numbers are much harder to napkin math, particularly floating point numbers may vary a lot in _source.

I went ahead and tried a few simple experiments with _source myself and indexed a single document with only a single field as an integer and found pretty similar figures for simple documents that I manually compressed with LZ4 like this lz4 -c <<< '{"my_field":1}' | wc -c. I also tried doing this with vector fields. Suffice what I learned was it can be very difficult to ascertain how large _source is going to actually be and it has more to do with the number of bytes in the original request and is influenced by things like whitespace. You can change the compression type to ZSTD but honestly I would not recommend exploring this unless you absolutely need _source for your vector data. I know in the past I've said _source roughly doubles your total data storage but this really only works as rough napkin math at best I've learned. The docs I linked above do a good job of walking through the pros / cons of using and configuring _source.

Hi @john-wagster , you could reference the 1st mapping that I posted in Dense vector field in nested object - Elastic Stack / Elasticsearch - Discuss the Elastic Stack, from which I got the index testing results here. The file_section_text field should consume the most amount of storage after the file_section_embedding field.

Ah ok so with those mappings it makes more sense. If you turn off _source for the dense vector field then you still have some additional fields which are contributing.

Hi @john-wagster , I think the above mentioned detail might explained why the _source field was reduced from 163.5gb to 5.7gb after excluding the dense vector field from _source. In our text embedding vector, each value in the vector is at roughly precision 16 decimal digits, like -0.00405535101890564. If they are stored as characters, and if a single byte for each character, then this would be (1024*16*10376000)/(1024*1024*1024) ~= 158gb storage.

@yli

Is there really any value in even sending that level of precision in the dense vector to elastic, if its going to squashed (maybe not right word) into 4-byte floats? What is sort of effectively random noise just occupies (a lot of) space in _source - if you pre-truncate before indexing, what would you be losing ?

Hi @RainTown , thanks for pointing it out. We were thinking about using stored raw vectors, output from our embedding model, in the _source for rescore purpose.

But it might be not really so necessary for such precision, we would need to evaluate search results again if reducing precision to save storage space.

You are not reducing precision in the dense vector, thats obviously already happened at index time.

This isn't really my area, so apologies for butting in, though I do have a (old) masters degree in neural networks, but I'd be surprised if some sort of rescoring based on the "fuller" precision vectors in _source made an objectively significant quality improvement in search results. What you might get are different results, but objectively significant quality improvement? Of course it depends on whether the "fuller" precision is really containing more "signal", or just more significant digits of the "noise".

I certainly could be surprised of course, if the evidence supports it, so would be interested in results if you did run the comparison.

@RainTown I actually meant the search results after rescore based on the values from _source.

I will find time to do the comparison, I am not sure how much this will influence the quality, but hopefully it would keep the quality level.

You need balance quality (and how you are measuring and error estimating there is an interesting side street we dont need to go down) and cost, both the storage cost and other (total) costs of the rescore. My hunch, only a hunch, no evidence, is it won't be worth it - barely measurable quality differences for the extra cost. YMMV.

Paying for the license to have synthetic source wont help you (much) here, as that would reconstructs the _source from the (now trimmed) dense_vector - yep, a little test with 8.16 (no license needed pre-8.17) confirmed that. so you save some disk space, but all the other synthetic source downsides still apply.

1 Like