Dense Vector Field Extremely Large

Hi all,

Have been experimenting with applying both forms of compression to our dense vectors and doing performance comparisons, but while bbq_hnsw has been performing relatively well at 1-3s per query average, int8 has been extremely slow with roughly 8-10s+ per knn query.

I dug into the disk usage of both the vector fields (which both contain the 10M vectorized images using 512-dimensions) thinking maybe the fields are bigger than I think and I didn’t give enough memory to the pod hosting it to have all the fields in memory, and while for 10M 512 dimensioned assets the BBQ compression vector looks about right in size with a rescore of 3 built-in, the int8 is just off the chart. That seems like the combined total of uncompressed vectors and compressed vectors.

"image_vector_bbq": {
"total": "1.3gb",
"total_in_bytes": 1443088252,
"knn_vectors": "1.3gb",
"knn_vectors_in_bytes": 1443088252
},
"image_vector_int8": {
"total": "26.4gb",
"total_in_bytes": 28400884275,
"knn_vectors": "26.4gb",
"knn_vectors_in_bytes": 28400884275
},

Is this normal behavior? I couldn’t find any way to further break down this number via the documentation, off the top of my head since it is only in one and not the other, I’m leaning towards not normal. Just looking to see if maybe the reason its so slow is that its for whatever reason trying to load 26.4gb into memory when only 10GB is allocated to the application with a 5GB heap size. I do realize the heap size may need to inch up a bit as theoretically both compressed vectors combined sizes using the estimations are likely in the 6-7 GB range. Regardless, my understanding was that the non-compressed vectors are supposed to be stored on disk and inflate the _source size but not be stored within the int8 compressed vector field.

For extra context, our kNN search is using a k of 30, a num_candidates of 200, and a rescore of 20. It’s definitely bigger than the k=8-10 I’ve seen around, so thats also a possible cause of the slowdown, but I’m trying to get rid of any core architecture possibilities before modifying the query.

@nicky welcome to forums!

1-3s per query average, int8 has been extremely slow with roughly 8-10s+ per knn query

That seems slow to me in general. We should dig into that some more. Can you share the mappings you have for those vector fields and maybe a little more information about your k8s setup. I’m curious what kind of Disk I/O and CPU you have here. Sounds like 10GB of RAM per pod?

"image_vector_bbq": {
"total": "1.3gb",

this makes sense to me but let’s break it down so you have the intuition for the math. BBQ compresses the vector be 1 bit per dimension + 14 (3 floats and a short) bytes for corrective factors.

So we have:

10000000*(512/8 + 14) = 0.78GB

And we have the HNSW structure it self (mostly a bunch of pointers). But should roughly be:

(12*4)*10000000 = 0.48GB

Total then:

~1.26GB

Looks pretty close (close enough for horse shoes and hand grenades anyway)

"image_vector_int8": {
"total": "26.4gb"

Then for int8 let’s see what that looks like:

we have for int8 compression for the vectors themselves:

10000000*512 = 5.2GB

and then for the HNSW graph:

(12*4)*10000000 = 0.48GB

total then:

~5.68GB

So what you have definitely seems off to me too by about 20GB, which just happens to be about the size of the raw vectors in this case and might explain the slowness just because there’s a lot loaded in RAM maybe because it’s in source. When HNSW doesn’t have enough RAM the algorithm falls off a performance cliff. (Funny enough we are just about launch an algo that’s going to deal with that performance cliff called bbq_disk, but I digress). So my guess is something is off with your config on the int8 mappings but honestly not entirely sure what that might be right off.

Hopefully looking at the config will help. And/or just seeing the expected math might be sufficient for you to see something obvious. Something tells me if we solve the sizing the slowness will make sense too. Either way let me know and happy to iterate with you on it.

Thanks a bunch John!

We run a few things per pod related to ES, but the main ES container is allocated 10GB of RAM per pod. We run it on a n2-standard-8 equivalent and give it a limit of 4000m CPU power (milliCPU is the most bizarre measurement to wrap my head around). Storage is a Google persistent storage SSD that we have limited at nearly 2.5x the current index size total. We also have the JVM arguments set at "-Xms5g -Xmx5g" for ES_JAVA_OPTS – which I still think may need to be upped because even if it were just attempting to load just the compressed vectors of ~7GB seems like too little.

Yeah I definitely felt like the int8 vector mapping should not be including the raw vectors, and my assumption is that whatever is included in the _disk_usage request analysis under the int8 mapping it will be attempted to loaded into memory – which will definitely cause slow searches. I was making sure I wasn’t misunderstanding what I was seeing when doing the analysis and that it wasn’t a case where while it reports 26.8GB, it is only attempting to load that ~5GB of compressed vectors while just denoting the other 21GB is the full vectors solely on disk.

Anyhow, here are our mapping for these:

                            "image_vector_bbq": {
                                "type": "dense_vector",
                                "dims": 512,
                                "index": True,
                                "index_options": {
                                    "type": "bbq_hnsw",
                                },
                            },
                            "image_vector_int8": {
                                "type": "dense_vector",
                                "dims": 512,
                                "index": True,
                                "index_options": {
                                    "type": "int8_hnsw",
                                },
                            },

As for insertion, we vectorize an image then both “image_vector_bbq” and “image_vector_int8” are assigned that uncompressed vector as a value before insertion into the index.

Even if this doesn’t provide perfect clarity into what exactly is causing the issue – confirming it is definitely a case where the int8 mapping in the index reports 26.8 GB so it is attempting to load 26.8 GB into memory, then failing, so it uses SSD storage and that throws performance down a drain? Again, thank you so much for the helpful explanation and walkthrough of what we should be expecting; even knowing that this is a problem is very helpful!

Hmm, the configs look ok. What version of ES are you running as well? I’m not remembering but I wonder if there was a bug in the computation for how much disk is being used. Have you tried excluding those fields both from source as well:

  "mappings": {
    "_source": {
      "excludes": [
        "image_vector_bbq",
        "image_vector_int8"
      ]
    }
  }

I’d be curious if that changes the output of the _disk_usage api. I’ll go do a quick test myself too and see if that stuff all outputs expected values on the latest ES version.

I do bet your memory constrained either way for int8. If you can you might try a slightly bigger machine with more RAM or for testing drop to like half of your data and see if that helps. I’d be curious if that greatly improves performance there.

k of 30, a num_candidates of 200, and a rescore of 20

I missed some of this on my initial read too. So k=30 seems fine to me. num_candidates=200 also seems fine but with a caveat that you may find that you can and should tune this differently for each algo. The same is true of rescore which I’m assuming in this case is the oversample param (if you drop your query config here we can iterate on that too). 20 for oversample seems really high to me. I would expect int8 to not need it at all and may be a source of slowness there. And I would expect bbq to definitely need it but probably not at 20 . Curious if y’all have experimented with a smaller value there yet at all. But I bet that’s a large source of slowness. It’s data dependent but usually I’m trying to hit like a consistent recall / ndcg from some golden set rather than keeping those params the same for comparison. The reason for this is int8typically isn’t actually lossy, a lot of models just have too large a vector space. But bbq purposefully compresses into lossy territory but is so fast in distance computations that we can do much larger num_candidates and oversample exploration often times for better results in query time and recall.

We are on Elastic 9.1.0.

Also, is it safe to remove them from _source? I was under the impression that they should remain in the source for search-ability sake but maybe I have a misunderstanding of how it needs to be loaded. I also thought that once removed, you’d have to reindex to get them back if ever needed; might also be wrong on this though.

On our dev pods, we have like ~800,000 assets and int8 is quite snappy there and is about the same time to return as BBQ.

As for our query, yeah we are just testing the water with that one for now and found that BBQ for our data with that query has performed pretty well on that both the 800k pod and the 10m pod. It definitely seems overkill for int8, and dropping the k to 10, num candidates to 50-100, and rescore to 5 decreases average time to like 2-3s for an int8 query which is much more manageable. However, this heavy query runs just fine for int8 on that smaller dev server so maybe it is a RAM thing.

In the meantime I will bump the RAM and heap size for the instance a couple GB and see if that helps since it should – if its just some reporting issue for that 20GB extra – now be able to store the compressed vectors and the HNSW map in memory.

You are not wrong. However, it’s safe to remove them from source for the sake of things like rescoring. It’s what I’d recommend. In fact in subsequent versions here we are going to remove them from source by default. In 9.1.0 they’ll be missing only for the sake of re-indexng. Storing them in source actually double stores the “raw” representations interestingly. We have one that we store in Lucene on index that is used for rescoring but that’s not the source. In subsequent versions we’ll reconstitute the vectors for the sake of reindexing from that other raw copy we already have on disk. Relevant PR (in case you want to learn more): Enable `exclude_source_vectors` by default for new indices by jimczi · Pull Request #131907 · elastic/elasticsearch · GitHub. Should save a ton of space. Only reason to ever store in source is if you have some real need to get back absolutely exactly what the vector was when you loaded but at the cost of storing that exact representation.

1 Like

Ok just to make sure before beginning acting on this, the heap, if manually set at all, should still be 50% of available memory in this case. I was going to bump ES memory to 14GB and allot the JVM a heap of "-Xms7g -Xmx7g", as that should theoretically fit the vectors.

The _source mapping is a valid point then, but I may save that for a future reindex as it can take over a week to reindex everything. Another will inevitably come so good shout for when we do that!

So I ran my own local experiments to help give a baseline. I started doing this expecting it to be quick and wound up digging into the code to validate what I was seeing. Honestly I think I just wanted to better understand the relationship here between the disk_usage api and what I understand to be the RAM needed for HNSW. I ran on 9.1.0 and kept the source loaded but I didn’t see any difference with main. The numbers below were off of main. The math I do I believe lines up and is a better explanation than I previously provided.

So here’s the mapping I created:

curl -XPUT --header 'Content-Type: application/json' "http://localhost:9200/test" -d '{
  "mappings": {
    "properties": {
       "image-vector": {
        "type": "dense_vector",
        "dims": 64,
        "similarity": "l2_norm",
        "index": true,
        "index_options": {
          "type": "int8_hnsw"
        }
      }
    }
  }
}'

And here’s loading 10k vectors at 64 dims:

seq 1 10000 | xargs -I % -P1 curl -XPOST --header 'Content-Type: application/json' "http://localhost:9200/test/_doc" -d "
    { \"image-vector\": $(python -c 'import numpy as np; print(np.random.random(64).tolist())') }
"

And here’s running the disk_usage api:

curl -XPOST --header 'Content-Type: application/json' "http://localhost:9200/test/_disk_usage?run_expensive_tasks=true" -d '' | python -mjson.tool

with output:

{
    "_shards": {
        "total": 1,
        "successful": 1,
        "failed": 0
    },
    "test": {
        "store_size": "3.3mb",
        "store_size_in_bytes": 3557605,
        "all_fields": {
            "total": "3.3mb",
            "total_in_bytes": 3517444,
            "inverted_index": {
                "total": "180.6kb",
                "total_in_bytes": 184956
            },
            "stored_fields": "41.7kb",
            "stored_fields_in_bytes": 42780,
            "doc_values": "14.6kb",
            "doc_values_in_bytes": 14994,
            "points": "12.6kb",
            "points_in_bytes": 13002,
            "norms": "0b",
            "norms_in_bytes": 0,
            "term_vectors": "0b",
            "term_vectors_in_bytes": 0,
            "knn_vectors": "3.1mb",
            "knn_vectors_in_bytes": 3261712
        },
        "fields": {
            ...
            "image-vector": {
                "total": "3.1mb",
                "total_in_bytes": 3261712,
                "inverted_index": {
                    "total": "0b",
                    "total_in_bytes": 0
                },
                "stored_fields": "0b",
                "stored_fields_in_bytes": 0,
                "doc_values": "0b",
                "doc_values_in_bytes": 0,
                "points": "0b",
                "points_in_bytes": 0,
                "norms": "0b",
                "norms_in_bytes": 0,
                "term_vectors": "0b",
                "term_vectors_in_bytes": 0,
                "knn_vectors": "3.1mb",
                "knn_vectors_in_bytes": 3261712
            }
        }
    }
}

What matters here is test.fields.image-vector.knn_vectors_in_bytes which make up the the quantized vectors, the hnsw index, and the raw vectors. I found the hnsw index is a pretty small chunk in reality. For me it was more like vector count * 16 because for this number of vectors it’s really only 1 layer that matters from a sizing perspective. Much smaller than the vector count * 48 I mentioned previously. This makes some sense because the graph should be roughly 1 byte per connection for m connections and with less vectors I expect less vectors on higher levels. The more vectors the larger the graph structure.

So doing the math here:

int8 quantized + hnsw index + raw vectors ~= 
10_000 * 64 + 10_000 * 16 + 10_000 * 64 * 4 ~= 3360000

For what you were seeing here’s what I would have expected to see vs what you did see:

bbq quantized + hnsw index + raw vectors ~=
10_000_000 * (512/8+14) + 10_000_000 * 48 + 10_000_000 * 512 * 4 ~= 21740000000
actual = 1443088252

So that should have been 21g not 1.3g. After going over the numbers I am beginning to suspect that the number of indexed docs is less than 10_000_000.

For int8 what I would expect vs what you were seeing:

int8 quantized + hnsw index + raw vectors ~=
10_000_000 * 512 + 10_000_000 * 48 + 10_000_000 * 512 * 4 ~= 26080000000
actual = 28400884275

If there were less docs indexed into the bbq_hnsw index then that’s about the only explanation I can think of. But figured I’d explain my reasoning here and that maybe that would help you either see something off on your side or something obviously wrong in my math. :slight_smile:

Doing 50% is generally a good idea as a starting place in my opinion. However, the HNSW graph largely sits in off heap RAM. So while I’m not entirely sure what an optimal ratio would be here my guess is if you lower the heap it will help. You might start with 50% and then lower the heap and see if it improves things or lets you use less overall memory.

I had a thought last night. I thought wait what if Lucene is dedupping the vector and I just didn’t realize it could do that. As in if you loaded two identical vectors into two separate fields would we detect and not store the raw vectors twice!

I tested that too. And as you might expect we don’t dedup those. This is fun though.

mapping

curl -XPUT --header 'Content-Type: application/json' "http://localhost:9200/test" -d '{
  "mappings": {
    "properties": {
      "image-vector": {
        "type": "dense_vector",
        "dims": 64,
        "similarity": "l2_norm",
        "index": true,
        "index_options": {
          "type": "bbq_hnsw"
        }
      },
      "image-vector2": {
        "type": "dense_vector",
        "dims": 64,
        "similarity": "l2_norm",
        "index": true,
        "index_options": {
          "type": "int8_hnsw"
        }
      }      
    }
  }
}'

adding docs:

VECTOR=$(python -c 'import numpy as np; print(np.random.random(64).tolist())');
seq 1 10000 | xargs -I % -P1 curl -XPOST --header 'Content-Type: application/json' "http://localhost:9200/test/_doc" -d "
    { \"image-vector\": $VECTOR,
	  \"image-vector2\": $VECTOR }
"

relevant output of disk_usage:

            "image-vector": {
                "total": "2.6mb",
                "total_in_bytes": 2801631,
                "inverted_index": {
                    "total": "0b",
                    "total_in_bytes": 0
                },
                "stored_fields": "0b",
                "stored_fields_in_bytes": 0,
                "doc_values": "0b",
                "doc_values_in_bytes": 0,
                "points": "0b",
                "points_in_bytes": 0,
                "norms": "0b",
                "norms_in_bytes": 0,
                "term_vectors": "0b",
                "term_vectors_in_bytes": 0,
                "knn_vectors": "2.6mb",
                "knn_vectors_in_bytes": 2801631
            },
            "image-vector2": {
                "total": "3.1mb",
                "total_in_bytes": 3261630,
                "inverted_index": {
                    "total": "0b",
                    "total_in_bytes": 0
                },
                "stored_fields": "0b",
                "stored_fields_in_bytes": 0,
                "doc_values": "0b",
                "doc_values_in_bytes": 0,
                "points": "0b",
                "points_in_bytes": 0,
                "norms": "0b",
                "norms_in_bytes": 0,
                "term_vectors": "0b",
                "term_vectors_in_bytes": 0,
                "knn_vectors": "3.1mb",
                "knn_vectors_in_bytes": 3261630
            }

math:

# bbq_hnsw
10_000 * (64/8+14) + 10_000 * 16 + 10_000 * 64 * 4 = 2940000

# int8_hnsw
10_000 * 64 + 10_000 * 16 + 10_000 * 64 * 4 = 3360000

Interesting deep dive! Thank you for sharing that, definitely gave some insight into what’s expected. When you mentioned the deduping, I thought surely this was it and explains it all. However, I ran both as a query:

'{
"aggs": {
 "vector_counts": {
  "filters": {
   "filters": {
    "has_int8": {
     "exists": {
      "field": "image_vector_int8"
}},
    "has_bbq": {
     "exists": {
      "field": "image_vector_bbq"
}},
    "has_both": {
     "bool": {
      "must": [{"exists": {"field": "image_vector_int8"}},{"exists": {"field": "image_vector_bbq"}}]}}}}}}
}'

and got:

{"took":633,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":{"value":10000,"relation":"gte"},"max_score":null,"hits":},"aggregations":{"vector_counts":{"buckets":{"has_bbq":{"doc_count":10184491},"has_both":{"doc_count":10184491},"has_int8":{"doc_count":10184491}}}}}

I double checked an asset we intentionally skipped and those do not appear in the exists query. It seems they both have the same amount.

If it is worth anything, we do run our elasticsearch via the python dependency, if that makes a difference.

I will also note that adding 4GB of memory off-heap reduced the average time of the int8 query, even with it being as heavy & intense as it is, to like 1-3s rather than 8-10s.

It seems they both have the same amount.

Hmm yeah I'm not sure why you see so little for bbq_hnsw. The disk_usage api is focused on the disk space which of course is not an exact corollary with memory usage but should be close. If I get some time here I'll see if there's a bug there with that api.

I will also note that adding 4GB of memory off-heap reduced the average time of the int8 query, even with it being as heavy & intense as it is, to like 1-3s rather than 8-10s.

Nice. I'm glad that helped. Let me know if I can help moreso or feel free to drop other notes or thoughts here too. I'd be curious how your evaluations pan out for you. HNSW is pretty picky about having enough memory. The upcoming bbq_disk algo should be a lot easier to deal with in terms of memory usage.