We have dedicated ES clusters we use for our vector database and approx kNN searching (single hnsw_int8 dense vector) as part of our hybrid-search solution.
We've started to push one cluster pretty hard, and noticed before that if we scale a couple more nodes horizontally, performance can get slightly worse in our perf tests under normal load
My theory is this is probably us not getting the best out of caching with more nodes. I'm aware that vector search relies on the underlying Linux Page Cache and we've already made sure we've got plenty of RAM available on the nodes, so we've started looking into pre-loading the index files as described here
But we don't currently have any .vex or .veq files in our data folder
What we do have are:
.si
.cfe / .cfs
.dvd / .dvm
.fnm
So my question are:
Which of these would we be best to pre-load?
Am I right in saying our vectors are probably stored in the .cfe or .cfs files (which my Google-fu tells me are Lucene compound files)
If so, what triggers Lucene to make compound files? Is it default now? Is it possible that the index could have either .vex and .veq and or .cfe and .cfs? I.e. should we configure pre-load for all those file extensions?
Compound files are normally used when segment sizes is less than 1GB. Is that your case?
You could preload compound files, but that's probably too much, more so with quantized vectors - so you won't need to prewarm the non-quantized vector values for example.
Yes, our segments will be pretty small. We have a trickle of indexing events throughout the day as products come in / out of stock, so lots of small segments getting written then merged by the background process...
Even when ES merges segments, we only have ~150K documents max, and because this is a dedicated kNN cluster, those documents are super basic (ID + vector embedding), so the merged segments are probably still pretty small - I have definitely seen .v* files in there in the past though!
So would I be correct in saying that theoretically when a segment merge happens if the merged segment is > 1GB, Lucene might decide to write it out as .vex + .veq instead?
If so, I think we'll do a quick perf test with pre-loading ['.cfe', '.cfs', '.vex', '.veq'] and see if makes a difference to us From what you say, I suspect possibly not...
Even when ES merges segments, we only have ~150K documents max
At that volume, you might want to experiment with doing exact kNN search via script_score. You'll get better search results via exact kNN, you'll need to check that the latency is appropriate for your use case.
See this blog post for more details on approximate vs exact knn.
So would I be correct in saying that theoretically when a segment merge happens if the merged segment is > 1GB, Lucene might decide to write it out as .vex + .veq instead?
That is correct - the default merge policy for Elasticsearch uses 1GB as the limit for using compound files.
If so, I think we'll do a quick perf test with pre-loading ['.cfe', '.cfs', '.vex', '.veq'] and see if makes a difference to us From what you say, I suspect possibly not...
Keep in mind that you're searching over many small segments - that's going to make search slower, as knn needs to go over every segment for getting results. You should try to get less segments for the knn search side - adjusting the merge policy, or doing periodic force_merge merges could be used for that.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.