Help optimizing slow knn query

Hi,

I'm running into a problem with the hybrid search ( keyword + vector ). As far as I can tell the performance bottleneck comes from the vector part for which I use the knn query.

Bellow, I'll share:

  • The query I'm performing
  • The indices settings, mappings, and size
  • My cluster resources
  • Some profiling insights from the query
  • Things I tried to optimize it

I'm hoping someone will be able to give me some ideas/direction about what I'm missing and if Elastic is the right tool for my needs.

This is a sample of how my query looks:

GET index1,index2,index3,index4,index5,index6/_search
{
	"size": 10,
	"_source": ["includes": ["_id"]],
	"retriever":
	{
		"rrf":
		{
			"retrievers":	[
			{						
				"standard":
				{
					"query":
					{
						"bool":
						{
							"minimum_should_match":1,
							"should":
							[
								{
									"bool":
									{
										"filter":
										{
											"terms":
											{
												"_index":["index1","index2","index3","index4","index5"]
											}
										},
										"must":
										{
											"knn":
											{
												"field":"field1_vector",
												"num_candidates":10,
												"similarity":0.5,
												"query_vector": [some_vector_of_float_numbers]
											}
										}
									}
								},
								{
									"bool":
									{
										"filter":
										{
											"terms":
											{
												"_index":["index1","index2"]
											}
										},
										"must":
										{
											"knn":
											{
												"field":"field2_vector",
												"num_candidates":10,
												"similarity":0.5,
												"query_vector": [some_vector_of_float_numbers]
											}
										}
									}
								}
								...17 more should expressions like this
							]
						}
					}
				},
				{
					"standard":
					{
						"query":
						{
							"simple_query_string":
							{
								"fields":["field1", "field2", ...17 more],
								"query":"some_full_text_query"
							}
						}
					}
				}
			}
		}
	}
}									

These are the indices details

  • I have 6 indices which I wanna search into.
  • The two biggest indices are around 60GB each.
  • In total they have around 2.1 million documents with many text fields, but don't share exactly the same schema.
  • Some indices are 10x bigger than others.

Example of index settings and mappings

{
  "index1": {
    "mappings": {
      "dynamic": "true",
      "dynamic_date_formats": [
        "strict_date_optional_time",
        "yyyy/MM/dd HH:mm:ss Z||yyyy/MM/dd Z"
      ],
      "dynamic_templates": [],
      "date_detection": true,
      "numeric_detection": false,
      "properties": {
        "field1": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "field1_vector": {
          "type": "dense_vector",
          "dims": 384,
          "index": true,
          "similarity": "dot_product",
          "index_options": {
            "type": "int8_hnsw",
            "m": 16,
            "ef_construction": 100
          }
        },
		"field2": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "field2_vector": {
          "type": "dense_vector",
          "dims": 384,
          "index": true,
          "similarity": "dot_product",
          "index_options": {
            "type": "int8_hnsw",
            "m": 16,
            "ef_construction": 100
          }
        }
		...other fields with exactly the same config
      }
    },
    "settings": {
      "index": {
        "lifecycle": {
          "name": "hot_to_warm_immediate"
        },
        "routing": {
          "allocation": {
            "include": {
              "_tier_preference": "data_warm,data_hot"
            },
            "require": {
              "_tier_preference": "data_warm"
            }
          }
        },
        "number_of_shards": "1",
        "provided_name": "index1",
        "analysis": {
          "filter": {
            "english_stemmer": {
              "type": "stemmer",
              "language": "english"
            },
            "english_stop": {
              "ignore_case": "true",
              "type": "stop",
              "stopwords": "_english_"
            },
            "english_possessive_stemmer": {
              "type": "stemmer",
              "language": "possessive_english"
            }
          },
          "analyzer": {
            "default": {
              "filter": [
                "lowercase",
                "asciifolding",
                "english_possessive_stemmer",
                "english_stemmer",
                "english_stop"
              ],
              "type": "custom",
              "tokenizer": "standard"
            }
          }
        },
        "priority": "50",
        "number_of_replicas": "1"
      }
    }
  }
}

Cluster resources
As you can see from the index settings I'm not using the hot tier at all. Everything goes almost directly to the warm tier.

I'm using ES cloud with 2.78 TB storage | 15 GB RAM | 1.9 vCPU, 2 zones

Profiling insights
As far as I can tell most of the time goes to vector comparison. At least I think so based on the rewrite_time metrics. I'm attaching a raw profiling session from a call that took almost 20sec.

Here's the JSON from the profiling -> https://file.io/teMSFpWyXBdG

What I tried

  • Scale up the cluster - the options with 60GB RAM seem to give relatively acceptable response times, but I think it's just too much for 2 million docs.
  • Reduce number of segments to 5 for the biggest indices - didn't do much in terms of performance.
  • Make sure the _source is as minimal as possible - the _id will work for my use-case. So this is what you can see in the shared query. This action gave ~3x performance gain.
  • It might be an option to further reduce the amount of knn queries I perform. Currently, they are around 19.

Final goal
I'll need to run this at scale, with potential thousands of searches per minute.

What else can I do to improve the performance and is Elastic even the right tool for the job?

In the mapping, I'd also exclude field1_vector, field2_vector from the _source.

Something like:

{
  "mappings": {
    "_source": {
          "excludes": ["field1_vector", "field2_vector"]
      },
    "properties": {
      "field1_vector": {
        "type": "dense_vector"
      },
      "field2_vector": {
        "type": "dense_vector"
      }
    }
  }
}

In case it helps.

ES cloud with 2.78 TB storage | 15 GB RAM | 1.9 vCPU, 2 zones

Are you using this? Specifically the hardware model.

What else can I do to improve the performance

I let some other experts to add their other thoughts :wink:

is Elastic even the right tool for the job?

I hope so!

How many vector fields are there?

The MINIMUM amount of memory required for each field will be:

num_vectors * (384 + 20)

So, for just one vector and 2 million docs thats about 808MB.

I am not sure what your cloud profile is. The vector optimized gives a little bit more off heap, but otherwise, you get a little bit less than half of the RAM for off-heap (required for vectors). So, maybe you are around 13GB of off heap total (two 15GB nodes).

I also see you have 1 replica, so double any ram requirements I mentioned above (2M vectors would be 1.6GB for a single field).

1 Like

@dadoonet

Ah sorry, forgot to add it to my sample query. I'm specifically excluding all vector fields, yes.

Yep, I'm using Vector search optimized (ARM)

@BenTrent

Each index has a different amount of vector fields, but for the bigger ones, I have 10 vector fields for one and 6 for the other. The rest of the indices I guess are inconsequential as they have around 20k docs on average, whereas the two indices I mentioned have around 1 million each. So if I'm not mistaken that would add up to:
4GB + 2,4GB + roughly 1GB for all the rest = 7.4GB required RAM.
If I double it, it might go a little over I guess. Would that be too much of an issue to cause 20-second response times?

Ah just now, I noticed you've put the exclude in the mappings of the index itself. Is that different than the specific exclude query time?

Yes it should be faster the way I mentioned AFAIK.

2.78 TB storage | 15 GB RAM | 1.9 vCPU, 2 zones

Is this profile PER node or the total resources over 2 zones?

This isn't the vector profile. I am not sure what profile that is. I am assuming dense storage as the disk is very high ratio compared to ram and vcpu.

Deploying my own single node that is 15GB | 1.9vCPU and calling _nodes I see "-Des.total_memory_bytes=16101933056",

Then for JVM allocation "heap_max_in_bytes": 8053063680

So, the single node has about 7.49GB of off heap.

Your data set since it has replicas requires about 15GB for vectors only. But you are also doing other types of queries. You are right on the edge of what is required only for vectors, but you don't have any leeway at all for other queries, which also require some memory to run (term postings, etc.).

I suggest:

  • going up a level in node size (you don't need to go to the max level of 60)
  • Increasing the quantization to int4_hnsw

Your latency would also improve with more vCPUs.

1 Like

Thanks guys!

I ended up reducing the vectors count, which allowed me to get into the 7.5GB limit of the 15GB subscription plan. I must say the plan representation is rather confusing as it says 15GB next to each node so I was with the impression it's 15GB per node, and not 15GB in total, for all nodes.

Anyway, maybe one last question, or rather a confirmation of a suspicion I have.
If I have many indices which are pretty active, meaning a lot of search and indexing requests are performed on them, leading to some requests happening simultaneously. I imagine the vectors for each active index will need to be held in memory and because of this, the required memory will add up pretty quickly. Is that correct?