Knn_vectors field understanding

Let me walk you through some of this and hopefully that will help a bit.

Also here's a related to discussion you can take a look at: Dense vectors taking up much more space than expected

The way this gets stored is we use Lucene under the hood to store both the quantized (int8 in this case) representation, which in your case is the 9.9GB of data AND the raw vectors at as you accurately computed about 39.59GB. This is for comparison, reranking, and retrieval. Minimally I would expect to see this disk usage with your model and number of documents always independent of your usage of _source.

_source adds to this total because it's a separate store of the raw vectors primarily used for making it easier to do reindexing. I would highly recommend you don't start with this enabled. If you read through the link above you'll see I mention synthetic source, which is a great option as it completely removes the storage component of _source but I do believe it is or is transitioning toward a paid feature now. So with _source enabled just from that one field I would expect ~40GB + ~10GB + ~40GB = 90GB of storage.

The numbers you mentioned seem a little odd to me, but here's my guess. The 163.5GB sounds like roughly double the cost of the storage I would expect with _source enabled, which likely means you are using the default of 1 primary shard and 1 replica so can expect twice the disk usage. The numbers don't quite line up but if you dig in you might find that's roughly what you are seeing. 5.7GB doesn't make a whole lot of sense to me unless it's without indexing all of the vectors or some other set of fields. I'm not sure where that number is coming form. But you may be able to get more information by using the disk usage api.

As for your questions:

2.1 Does the reranking happened automatically when we index new documents?

No reranking happens at query time once the set of candidates has been retrieved they are subsequently reranked. Doing reranking in any way prior to this would be extremely expensive.

2.2 What the quantization improvements include?

Quantization as a process allows more of the vectors to fit into an HNSW structure in memory and so the more quantization you can tolerate the less RAM you need to make your vector queries efficient. For instance we recently did a lot of work here to further reduce the compression ration with BBQ as you mentioned which is a form of advanced scalar quantization. In our experiments I've seen it's maintaining about the same quality as int8 but at a 32x compression ratio rather than just 4x. Highly recommend it but it's more about reducing RAM than disk.

2.3 Does it mean that we could still reindex even excluding dense vector field from _source ?

No if you do not have _source enabled or use something like synthetic source for that field you won't be able to easily reindex it and would have to repopulate it from some kind of external source system.

2.4 Does it mean that we could still use the raw vector values for rescore?

I believe _source has no impact on rescoring. The raw vectors can be used for rescoring independent of that.

2.5 Could we export all the raw vectors from the index which contains large amount of vectors? like 40GB or even more in our case.

If you have _source enabled then yes. We've talked about exposing this through Lucene but don't have a great way of doing so yet. So as of right now without _source or synthetic source you can't do this right now. Without knowing the full details myself synthetic source isn't something that just grabs the raw vectors from Lucene and rebuilds a _source field it's a little more complex than that and directly exposing what's stored in Lucene is non-trivial. There might be a way to script this but it's definitely not supported or documented.

Happy to answer more questions though if you have any.