KNN search speed


Today we are using ES mainly as a key / value store where most of our reads are just get by key.
We have recently started to use KNN, where we have:

  • Around 10MM docs.
  • 384 dim vectors.
  • Using cosine sim as the metric.
  • Documents are large (webpage), but we retrieve just the URL & vector.

The performance is really bad, but only on the first query. So making one query the time for results could be around 20 seconds, the second query (if done within a few seconds) will take 10 secs, and if we keep making KNN queries rapidly, the time drops to sub second.

If we stop making such queries for a few minutes the next query is once again 20 seconds or so.

I would love to understand this behaviour to see if there is something which can be done to "warm up" this query type.


@dendog1 have you read through: Tune approximate kNN search | Elasticsearch Guide [8.6] | Elastic


Some observations:

  • Being slow and then fast indicates to me that the vector index was out of memory and then added to memory
  • Becoming slow again shows that it is being kicked out of memory. Usually this indicates that you don't have enough ram to have the vector index and other structures you are using in the index.

So, is this index being used for other things? Are documents continually being added?

1 Like

Hey @BenTrent thank you so much for getting back to me!!

So yes I have read about tuning the search - those point are all noted and the options which we can do we have done.

On your observations, yes so this is the issue that I would like to understand - how to keep this index in memory.

This index at the moment:

  • 98% of work is get by ID, we are using ES as a redis like cache here.
  • 1.9% is inserts of new documents to this index.
  • 0.1% are the random KNN search queries which are users run on this index and where we really need good performance!

Is there anyway I can force ES to keep this infrequent operation "hot"?

Thanks in advance!

One finding I have is that sharding the index seems to be detrimental to speed, which seems rather odd as I though the sharding would allow parallel searches...

It depends.

If the different shards still have 40+ segments, it wouldn't help much. If your different shards had much fewer segments, then I would expect some improvement.

Thanks @BenTrent would you please be able to give some feedback about how one can force ES to keep this index and operation hot?

This index is accessed all the time, but the operations are very different.


I will have to dig a bit more on thinking about how to keep KNN vector files in memory preferential to others.

Something to check are:

It would be good to know how close to your vector index sizes are compared to your server's ram.


One thing to try is: Preloading data into the file system cache | Elasticsearch Guide [8.6] | Elastic

The vector index file extensions are: vem, vex, and vec.

@BenTrent again huge thank you for your replies, I ran the disk usage check:

    "store_size": "103.9gb",
    "store_size_in_bytes": 111633278466,
    "all_fields": {
      "total": "103.8gb",
      "total_in_bytes": 111525920926,
      "inverted_index": {
        "total": "10.9gb",
        "total_in_bytes": 11704910019
      "stored_fields": "78.3gb",
      "stored_fields_in_bytes": 84076073351,
      "doc_values": "2.5gb",
      "doc_values_in_bytes": 2709327298,
      "points": "293.3mb",
      "points_in_bytes": 307648946,
      "norms": "41mb",
      "norms_in_bytes": 43087459,
      "term_vectors": "0b",
      "term_vectors_in_bytes": 0,
      "knn_vectors": "11.8gb",
      "knn_vectors_in_bytes": 12684873853

My machine size ram is 8gb, and the knn vectors are 11.8gb - does this mean we are at optimal performance of the current size of index vs cluster size?

Also our RAM does not even fit the index size in - is that an issue too?

i had a look the link for preloading data into file system cache, I can do this - but there is a warning there around the size, do you think it is still wise given the above? And finally on this point - I thought this would only make a difference for the first few requests, after that ES would load them into cache automatically - is that not correct?

For kNN to work optimally, the entire graph and vectors need to be in memory.

So, that means you need at least around 12gb of ram (not including the ram used by the JVM).

In 8.5 we added support for 'byte' encoded vectors. So you can quantize your vectors to int8 to make the much smaller and maybe run just fine within your current hardware constraints

Thanks @BenTrent that makes sense, and thank you for letting me know about the quantized vectors!

One more question, would it make sense for me to create another index which would be a sample of the full index, and it would contain just the vectors and some minor metadata. If I make such an index which would be much smaller - would I be able to force ES to hold this in memory even if it is not frequently accessed?

Thank you!

@dendog1 this would only happen if those indices were on different nodes. A node only has so much off-heap memory and all shards on that node must share it.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.