I'm building out a vector search application with 384 dimensional vectors. So far I have about 3 million documents in my Elasticsearch database using with a dense_vector field. There are currently 8 shards (I am expecting to load up about 80M documents). When I run a search using the _knn_search
endpoint, I find that it takes about 4s total. Is there something I can do to speed it up besides just throwing more computing resources at it? This is for a side project, it's running on a single server (4GB ram, 2 vCPUs), and I don't want to build out a huge cluster for the time being.
Hello @jalustig, 4 seconds definitely seems a bit slow for that data scale. Here are some notes that might help:
- If you only have 2 vCPUs and a single machine, then it only makes sense to have 2 shards at maximum. During search, Elasticsearch uses 1 thread per shard, and with 2 vCPUs only 2 shards can be searched in parallel. Since you have 8 shards, these additional shards will be queued and searched sequentially, which can slow down the search.
- It's important to refresh the index before beginning searches. Otherwise the kNN search can be blocked waiting for the buffered index requests to complete. This could make the first kNN search request quite slow.
- It can also help to force merge the index to a single segment before searching. These docs give more background: k-nearest neighbor (kNN) search | Elasticsearch Guide [8.2] | Elastic.
Finally, we currently recommend having enough RAM to hold all of the vector data. In your case this is 3 million docs * 384 dimensions * 4 bytes per element = ~4.6 GB
. It seems like you're right at the limit, and to accommodate more you may need to add more RAM (by adding additional servers, or scaling up the single machine).
Thanks. This is really helpful.
In the end I’ll probably have around 40M items indexed for kNN. It sounds like it won’t be possible to keep them all in memory without building out a more significant cluster.
In terms of refreshing the index, what happens if you keep adding items? You would do a bulk index of millions of items, refresh it, but then more data is coming into the index.
It's most important to call "refresh" after you've done the large bulk index. This is because there can be many docs buffered, making it expensive to prepare the index for search (what "refresh" does). If a smaller number of docs are buffered, the "refresh" on the initial search call is not too expensive.
This behavior is a bit 'trappy' and we're working to fix it within Lucene, so it won't be necessary to think so carefully about when to call "refresh". Here's the issue for reference: [LUCENE-10592] Should we build HNSW graph on the fly during indexing - ASF JIRA.
Thanks, @Julie_Tibshirani.
Re: keeping HNSW in memory, is there a way for ElasticSearch to keep a subset of the HNSW graph in memory? It seems like with a large dataset, it won't be possible to keep the whole data structure in memory, even with the generally recommended max of 32GB per node. But it would potentially benefit if parts of the data structure (e.g. the first couple of hierarchy levels) were kept in memory the whole time.
EDIT: To give some numbers, if you have 10M documents on a node, that would require 14.3 GB of heap to keep the HNSW graph in memory the whole time; 20M documents would require 28.6GB making that about the max number of docs you could store on a node (with a vector size of 368 dimensions), while keeping the graph fully in memory. I would expect that one would want to store more than 20M documents per node, particularly with replication. It might be beneficial to be able to store a subset of the graph in memory rather than the whole thing, or otherwise compress the data structure to utilize less ram.
This is a great point, and we're indeed thinking about strategies to avoid having to recommend so much memory. In addition to HNSW improvements, we've considered adding support for algorithms that are designed to work well when most data is on SSDs, such as DiskANN (GitHub - microsoft/DiskANN: Scalable graph based indices for approximate nearest neighbor search) or SPANN ([2111.08566] SPANN: Highly-efficient Billion-scale Approximate Nearest Neighbor Search).
We first need to do more testing on how HNSW performs when not all vector data can fit in memory, to understand just how much worse the latency is. If you happen to try this out, it'd be interesting to hear about your experience.
@Julie_Tibshirani that’s great that you want to implement different algorithms. I was just reading the SPANN paper the other day and it looks really, really cool.
For reference I have about 50M documents to index total, I used to have 80M but I reduced the scope of what I’m trying to do, and have embeddings for probably 35M of them. (I limit embeddings for documents where the text and other metadata is too limited to get something reliable.) The vector index isn’t going to all fit in memory even if I build out a larger cluster. What kinds of data / benchmarks would be most useful as you consider the options for different ways of doing ANN?
It'd just be interesting to hear about the kNN latency you're seeing (for a given GB of RAM, heap, and CPU, in addition to data scale and number of vector dimensions). Only if it's convenient for you, we are running our own benchmarks of course too!
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.