How to Increase the throughput for KNN in my ES cluster

Esteban_Velasquez · December 26, 2025, 12:23am

Hello!

I created a track in ESrally with some queries that are normally done in my ES cluster and I replicated the big index (using snapshot) from the original cluster I have into a new one. I am doing an ESrally test from another pod in the cluster to this testing cluster as I want to get to 100 QPS. Right now I am at 40 QPS and even when I increase the replicas or the data nodes it keeps the same.

My Esrally track is just about increasing the QPS while maintaining a less than 1s latency, for that I just gradually increase the clients.

The testing cluster has 512 GB SSD per node and a 64 GB RAM + 30 CPUs. I have everything isolated and it only has this one big index with 35 shards. The index has 1.8 TB of data and 12 Million docs. I have 5 master nodes and 5 data nodes, I have increased the data nodes from 5 to 10 and then to 15. 15 with more replicas has been the only way to get to 60 QPS

Is there any recommendation please or I must simply keep adding more nodes and replicas?

Also Would it be better to partition the index?

Thanks in advance for any guidance!

john-wagster · December 26, 2025, 2:49am

Can you share your rally track? Or minimally share your queries and knn index settings. My gut reaction is there’s probably some knobs you can fiddle with that might help.

Also I’m curious are you able to hit 100QPS on your own ES cluster and trying to tune your rally track to be the same or are you just looking to boost the performance on your ES cluster and using rally and a separate cluster to iterate to do so?

To answer your questions. I’m not sure if increasing the nodes and replicas will help particularly if you have already sized those things up a good bit and seen very little improvement.

Partitioning may help. Particularly if you have natural partitions in the data and can rolling delete or archive indices this will great reduce the overall search space. But if you want to be querying across those partitions then this likely won’t help you much.

Seeing the query and knn configuration will probably lead to better answers for your questions though.

Esteban_Velasquez · December 26, 2025, 3:48am

Thanks for the quick response first!

In my own ES cluster I am getting like 16 QPS in production with good results. I want to migrate to a better one with 100 QPS and I am testing with ESrally with a new one to see how I can get there. My own one only has 5 nodes and 5 gpus per data nodes, without replicas. The test one has the configs mentioned above

Here is my track:

{

    "short-description": "Max QPS load test",

"description": "Progressive load test to find maximum sustainable throughput",

"indices": [

        {

"name": "data-v0.0.7_2025_11_06",

"auto-managed": false

        }

    ],

"operations": [

        {

"name": "hybrid-search",

"operation-type": "search",

"param-source": "query-file-source",

"index": "data-v0.0.7_2025_11_06",

"cache": false

        }

    ],

"schedule": [

        {

"name": "hybrid-search-1-client",

"operation": "hybrid-search",

"warmup-time-period": 10,

"time-period": 60,

"clients": 1

        },

        {

"name": "hybrid-search-2-clients",

"operation": "hybrid-search",

"warmup-time-period": 10,

"time-period": 60,

"clients": 2

        },

        {

"name": "hybrid-search-4-clients",

"operation": "hybrid-search",

"warmup-time-period": 10,

"time-period": 60,

"clients": 4

        },

        {

"name": "hybrid-search-8-clients",

"operation": "hybrid-search",

"warmup-time-period": 10,

"time-period": 60,

"clients": 8

        },

        {

"name": "hybrid-search-16-clients",

"operation": "hybrid-search",

"warmup-time-period": 10,

"time-period": 60,

"clients": 16

        },

        {

"name": "hybrid-search-32-clients",

"operation": "hybrid-search",

"warmup-time-period": 10,

"time-period": 60,

"clients": 32

        },

        {

"name": "hybrid-search-64-clients",

"operation": "hybrid-search",

"warmup-time-period": 10,

"time-period": 60,

"clients": 64

        },

        {

"name": "hybrid-search-128-clients",

"operation": "hybrid-search",

"warmup-time-period": 10,

"time-period": 60,

"clients": 128

        }

    ]

}

An example query is this one without the 4096 vector explicitly:
{"body": {"size": 100, "explain": false, "_source": {"includes": ["chunk_id", "document_id", "datastore_id", "date_created", "document_name", "chunk", "metadata.chunk_size", "metadata.file_format", "metadata.page", "metadata.coordinates", "metadata.document_date_humanized", "metadata.extras.custom_metadata", "metadata.custom_metadata_config", "metadata.extras.section_id", "metadata.section_id", "metadata.extras.document_title", "metadata.extras.section_title", "metadata.extras.is_figure", "metadata.extras.file_name", "metadata.link", "metadata.extras.link", "metadata.type", "metadata.extras.type", "metadata.extras.next_page_chunk_locations"], "excludes": }, "query": {"bool": {"must": [{"nested": {"path": "metadata", "query": {"knn": {"field": "metadata.extras.embeddings_model_1", "query_vector": [0.013909942,...]], "num_candidates": 100}}, "score_mode": "max", "boost": 1.0}}], "filter": [{"term": {"store_id": "9784afae-4af4-44c2-a5d7-c24f51728b2c"}}], "boost": 1.0}}}}

and the track.py is:

import json

import os





def register(registry):

registry.register_param_source("query-file-source", QueryFileParamSource)





class QueryFileParamSource:

def __init__(self, track, params, **kwargs):

self._index = params.get("index", "data-v0.0.7_2025_11_06")

queries_file = os.path.join(track.root, "queries.json")

self._queries = []

with open(queries_file, "r") as f:

for line in f:

line = line.strip()

if line:

self._queries.append(json.loads(line))

self._index_pos = 0

self._cache = params.get("cache", False)




def partition(self, partition_index, total_partitions):

return self




def params(self):

query = self._queries[self._index_pos % len(self._queries)]

self._index_pos += 1

return {"index": self._index, "cache": self._cache, "body": query["body"]}

and my index config is with an 8.19 ES is:

"settings" : {

      "index" : {

        "routing" : {

          "allocation" : {

            "include" : {

              "_tier_preference" : "data_content"

            }

          }

        },

        "refresh_interval" : null,

        "number_of_shards" : "35",

        "provided_name" : "data-2025_11_06",

        "creation_date" : "1762919828038",

        "analysis" : {

          "filter" : {

            "synonym_filter" : {

              "format" : "wordnet",

              "updateable" : "true",

              "type" : "synonym",

              "synonyms_path" : "/usr/share/elasticsearch/config/synonyms.txt",

              "lenient" : "true"

            },

            "custom_synonym_filter_ffd7095b-7ceb-47a8-80d9-fa35cf03627e" : {

              "type" : "synonym_graph",

              "updateable" : "true",

              "synonyms_set" : "custom_synonym_set_ffd7095b-7ceb-47a8-80d9-fa35cf03627e"

            },

            "shingles_filter" : {

              "max_shingle_size" : "4",

              "min_shingle_size" : "2",

              "output_unigrams" : "true",

              "type" : "shingle"

            },

            "possessive_english_filter" : {

              "name" : "possessive_english",

              "type" : "stemmer"

            },

            "stopwords_english_filter" : {

              "type" : "stop",

              "stopwords" : "_english_"

            },

            "custom_synonym_filter_d26b3403-f35a-48e1-a7ce-ed1f1352149d" : {

              "type" : "synonym_graph",

              "updateable" : "true",

              "synonyms_set" : "custom_synonym_set_d26b3403-f35a-48e1-a7ce-ed1f1352149d"

            },

            "custom_synonym_filter_08d04d9e-8933-4784-b17a-ce8accd5a5c3" : {

              "type" : "synonym_graph",

              "updateable" : "true",

              "synonyms_set" : "custom_synonym_set_08d04d9e-8933-4784-b17a-ce8accd5a5c3"

            }

          },

          "analyzer" : {

            "custom_analyzer" : {

              "filter" : [

                "lowercase",

                "stopwords_english_filter",

                "possessive_english_filter"

              ],

              "type" : "custom",

              "tokenizer" : "standard"

            },

            "custom_synonyms_analyzer_d26b3403-f35a-48e1-a7ce-ed1f1352149d" : {

              "filter" : [

                "lowercase",

                "custom_synonym_filter_d26b3403-f35a-48e1-a7ce-ed1f1352149d",

                "possessive_english_filter",

                "porter_stem"

              ],

              "type" : "custom",

              "tokenizer" : "standard"

            },

            "custom_search_analyzer" : {

              "filter" : [

                "lowercase",

                "stopwords_english_filter",

                "possessive_english_filter",

                "synonym_filter"

              ],

              "type" : "custom",

              "tokenizer" : "standard"

            },

            "custom_synonyms_analyzer_ffd7095b-7ceb-47a8-80d9-fa35cf03627e" : {

              "filter" : [

                "lowercase",

                "custom_synonym_filter_ffd7095b-7ceb-47a8-80d9-fa35cf03627e",

                "possessive_english_filter",

                "porter_stem"

              ],

              "type" : "custom",

              "tokenizer" : "standard"

            },

            "edge_ngram_analyzer" : {

              "filter" : [

                "lowercase",

                "possessive_english_filter"

              ],

              "type" : "custom",

              "tokenizer" : "edge_ngram_tokenizer"

            },

            "custom_analyzer_with_shingles" : {

              "filter" : [

                "lowercase",

                "stopwords_english_filter",

                "possessive_english_filter",

                "shingle"

              ],

              "type" : "custom",

              "tokenizer" : "standard"

            },

            "custom_porter_stem_analyzer" : {

              "filter" : [

                "lowercase",

                "stopwords_english_filter",

                "possessive_english_filter",

                "porter_stem"

              ],

              "type" : "custom",

              "tokenizer" : "standard"

            },

            "custom_synonyms_analyzer_08d04d9e-8933-4784-b17a-ce8accd5a5c3" : {

              "filter" : [

                "lowercase",

                "custom_synonym_filter_08d04d9e-8933-4784-b17a-ce8accd5a5c3",

                "possessive_english_filter",

                "porter_stem"

              ],

              "type" : "custom",

              "tokenizer" : "standard"

            }

          },

          "tokenizer" : {

            "edge_ngram_tokenizer" : {

              "token_chars" : [

                "letter",

                "digit"

              ],

              "min_gram" : "3",

              "type" : "edge_ngram",

              "max_gram" : "6"

            }

          }

        },

        "number_of_replicas" : "2",

        "uuid" : "JE5DpNYiQ5eFpirKNSWsIA",

        "version" : {

          "created" : "8503000"

        }

      }

    }

  }

}

I just finished a test with 20 nodes and I got to 70 QPS but it is too much reosurces. I am thinking about just having two different ES clusters and balancing myself, since each one I can get 40 QPS with less vertical resources even. I know that for latency erasing the amount of source fields I am retrieving does help, So Maybe I will work for that, but if there are wuivcker solutions it would be really helpful!

Christian_Dahlqvist · December 26, 2025, 2:01pm

I would say that ensuring your queries and mappings are as efficient as possible is an imprtant first step as that can have a significant impact on query performance, especially if you are looking to support high levels of concurrent queries.

It looks like you are filtering on a specific term. Is this something all queries do? If so, what is the cardinality of this field? If the number is high it may make sense to use routing to ensure related documents are located in the same shard and only a single shard need to be queried in order to serve a query.

I see that you are using nested documents, whichg tend to add overhead when querying. You have not shared your mappings or full structure of the documents so it is hard to tell whether this is required or whether there may be some gains available here.

That is typically how you scale QPS, at least theoretically. If N nodes can store all your data (without replicas) and serve X number of queries per second within your latency target it should scale near-linearly with added nodes and replicas. If you set up 2N nodes and configure 1 replica you should be able to serve 2X queries per second. 3N nodes together with 2 replicas per shard should roughly support 3N queries per second etc.

This is why I recommend optimising queries and mappings initially as that may reduce the number and/or size of nodes in the initial cluster.

john-wagster · December 26, 2025, 9:22pm

Sorry I wasn’t clear earlier. Can you share your mappings so we can see how you’ve setup your knn fields that may help us recommend how to optimize your knn field configuration or queries.

Esteban_Velasquez · January 30, 2026, 9:11pm

My mapping is kinda complex since some parts of it are dynamic and nested. I have seen that flattening it may reduce the Latency as well as reducing the amount of fields retrieved, because the fetching is what usually takes most of the time. I was able nonetheless to scale to More QPS with more resources and less shards per nodes but it is pretty expensive.

This is my mapping

"mappings": {

    "dynamic_templates": [

      {

"full_page_summary_as_tokens": {

"match": "metadata.extras.full_page_summary",

"match_mapping_type": "string",

"mapping": {

"analyzer": "custom_analyzer",

"fields": {

"porter_stemmed": {

"analyzer": "custom_porter_stem_analyzer",

"type": "text"

              },

"with_synonyms": {

"search_analyzer": "custom_search_analyzer",

"analyzer": "custom_analyzer",

"type": "text"

              }

            },

"type": "text"

          }

        }

      },

      {

"keywords_as_tokens": {

"match": "metadata.extras.keywords",

"match_mapping_type": "string",

"mapping": {

"analyzer": "custom_analyzer",

"fields": {

"with_shingles": {

"analyzer": "custom_analyzer_with_shingles",

"type": "text"

              }

            },

"type": "text"

          }

        }

      }

    ],

"properties": {

"chunk": {

"type": "text",

"fields": {

"porter_stemmed": {

"type": "text",

"analyzer": "custom_porter_stem_analyzer"

          },

"with_synonyms": {

"type": "text",

"analyzer": "custom_analyzer",

"search_analyzer": "custom_search_analyzer"

          }

        },

"analyzer": "custom_analyzer"

      },

"chunk_id": {

"type": "keyword"

      },

"datastore_id": {

"type": "keyword"

      },

"date_created": {

"type": "date"

      },

"date_deleted": {

"type": "date"

      },

"date_updated": {

"type": "date"

      },

"document_id": {

"type": "keyword"

      },

"document_name": {

"type": "text"

      },

"embeddings": {

"type": "dense_vector",

"dims": 1024,

"index": true,

"similarity": "cosine"

      },

"git_commit_hash": {

"type": "keyword"

      },

"metadata": {

"type": "nested",

"properties": {

"chunk_size": {

"type": "integer"

          },

"chunk_type": {

"type": "keyword"

          },

"coordinates": {

"properties": {

"x0": {

"type": "float"

              },

"x1": {

"type": "float"

              },

"y0": {

"type": "float"

              },

"y1": {

"type": "float"

              }

            }

          },

"coordinates_x0": {

"type": "float"

          },

"coordinates_x1": {

"type": "float"

          },

"coordinates_y0": {

"type": "float"

          },

"coordinates_y1": {

"type": "float"

          },

"custom_metadata_config": {

"type": "object",

"enabled": false

          },

"document_date": {

"type": "date"

          },

"document_date": {

"type": "text"

          },

"document_date_rfc": {

"type": "keyword"

          },

"extracted_using": {

"type": "keyword"

          },

"extras": {

"dynamic": "true",

"properties": {

"chunk_order_index": {

"type": "long"

              },

"custom_metadata": {..  //a lot fields dynamically are added here but less than 500
},

"embeddings_model_1": {

                "type": "dense_vector",

"dims": 4096,

"index": true,

"similarity": "cosine"

              },

"empty": {

"type": "boolean"

              },

"file_name": {

"type": "text",

"fields": {

"keyword": {

"type": "keyword",

"ignore_above": 256

                  }

                }

              },

"file_route": {

"type": "text",

"fields": {

"keyword": {

"type": "keyword",

"ignore_above": 256

                  }

                }

              },

"is_figure": {

"type": "boolean"

              },

"link": {

"type": "text",

"fields": {

"keyword": {

"type": "keyword",

"ignore_above": 256

                  }

                }

              },

"section_id": {

"type": "text",

"fields": {

"keyword": {

"type": "keyword",

"ignore_above": 256

                  }

                }

              },

"section_title": {

"type": "text",

"fields": {

"keyword": {

"type": "keyword",

"ignore_above": 256

                  }

                }

              },

"src_doc": {

"type": "text",

"fields": {

"keyword": {

"type": "keyword",

"ignore_above": 256

                  }

                }

              },

"type": {

"type": "text",

"fields": {

"keyword": {

"type": "keyword",

"ignore_above": 256

                  }

                }

              }

            }

          },

"file_format": {

"type": "keyword"

          },

"page": {

"type": "integer"

          }

        }

Any other recommendations or am I wrong?

Christian_Dahlqvist · January 31, 2026, 10:44am

Did you change the specification of the nodes or just add more data nodes?

Can you please provide some additional information about this question I asked earlier?

How does load and latency change if you reduce the size of the result set being returned, e.g. go from 100 matches to 10?

As data is spread out, returning a large number of matches can result in a lot of random disk I/O. You mentioned earlier that you are using SSD storage. Is this local SSD storage or some kind of centralised storage accessed over the network?

Topic		Replies	Views
How can I tune for Elasticsearch performance? Elasticsearch	9	643	May 18, 2020
Increasing ES search throughput Elasticsearch	4	2201	July 6, 2017
Tune for indexing speed Elasticsearch	11	1576	January 2, 2023
Improve indexing throughput Elasticsearch	15	2692	July 6, 2017
Why "knn_query" doesn’t have a separate k parameter? Elasticsearch vector-search	13	727	May 9, 2024

How to Increase the throughput for KNN in my ES cluster

Related topics