High Latency on KNN Search

Hi everyone,

I’m running an Elasticsearch cluster (version 8.17) with the basic license. The cluster has 3 nodes, each on Linux servers with 62 GB RAM and 1 TB disk space. I’ve set up an index for vector search using the following mapping and settings:

Mapping:

{
    "mappings": {
        "_source": {
            "excludes": [
                "vector"
            ]
        },
        "properties": {
            "vector": {
                "type": "dense_vector",
                "index": true,
                "index_options": {
                    "type": "int8_hnsw"
                },
                "dims": 256,
                "similarity": "cosine"
            }
        }
    },
    "settings": {
        "index": {
            "refresh_interval": "60s",
            "number_of_shards": "3",
            "number_of_replicas": "2"
        }
    }
}

I’ve indexed 25 million vectors into this index. After that, I updated the settings as follows:

"merge": {
    "policy": {
        "max_merged_segment": "20g"
    }
},
"store": {
    "preload": [
        "vex",
        "veq"
    ]
}

When I perform a KNN search for the first time, the request latency is around 2 minutes. Subsequent requests for the same query are reduced to around 10 seconds.

I tested a similar setup (same cluster and data) with Qdrant, and it works fine with much lower latency for the initial queries.

Why is the initial KNN search so slow compared to subsequent searches?
Are there additional optimizations I can apply to improve the latency?

@Ahmad2356 how much JVM heap are you giving the Elasticsearch process?

Additionally, adding 2 replicas will triple the amount of ram required to search effectively as how you have 3 copies of the given shard to search and needed to be in memory.

(256 + 4*16) * 2,000,000 = 640000000 = is only about half a gig. So, even with the replicas, I would expect this to be loaded into memory.

By "initial queries" are you meaning only from a cold start and then all subsequent queries are fast? Or only repeated queries are fast?

I’ve indexed 25 million vectors into this index. After that, I updated the settings as follows:

FYI, those updates only take effect on mutations of the index. If you want them to have an effect, you should index AFTER you set them.

Thank you so much for taking the time to reply to my question—I appreciate your input!

Regarding the JVM heap size, I haven’t explicitly set it, so it’s currently using the default value, which I believe is based on my system memory (62 GB per node).

I noticed that the formula (256 + 4*16) * 2,000,000 = 640,000,000 differs from the formula mentioned in the Elasticsearch documentation for tuning approximate kNN search: num_vectors * 4 * 16. Could you clarify what the 256 + 4*16 and 2,000,000 parts represent? Is it related to the dimensions or another memory-related consideration?

I tested your suggestion and created a new index regarding the initial query latency. This time, I set the preload setting first and then indexed the data. The response time for different queries is between 1 and 2 seconds. Specifically, the took value in the Elasticsearch response body is around 1000–2000 milliseconds.

Thanks again for your guidance.

In my search, I’ve noticed that many Elasticsearch users recommend first indexing the data, followed by a force merge to a single segment, and then performing the search to get accurate benchmarks. However, in my use case, I often need to index and search simultaneously. This is why I’m looking for optimizations that work well with both indexing and querying in parallel.

This is often recommended in order to get an accurate estimate how different types of mappings and settings compare with respect to index size on disk. When it comes to querying it used to be that each search against a single shard ran in a single threaded mode and forcemerging down to a single segment could help improve response time as well as reduce memory usage. In recent versions of Elasticsearch I believe searches against different segments within a shard can be performed in parallel, so the old guidelines to forcemerge to a single shard (which is not great if data is not immutable) may no longer be optimal.

Indexing will create new segments and require in-memory structures to be rebuilt so is likely to always impact searches to at least some extent. You need to test and find a balance that works for your use case.

adding 2 replicas will triple the amount of ram required to search effectively as how you have 3 copies of the given shard to search and needed to be in memory.

@BenTrent Just curious! why does it triple the amount of ram required for search? won't the search happen in primary copies? It would be true incase of indexing but in search why do we need the replica to be in memory?

1 Like

No. Primaries and replicas do all serve queries so it has to be loaded into memory for all shards.

1 Like

Thank you for your detailed explanation! I greatly appreciate the insights you’ve shared. May I kindly ask for your opinion regarding the updated search strategy? To the best of my knowledge, this matter is still under discussion, as reflected in the following open issue: Elastic/elasticsearch#90700, which has been ongoing since 2022.

From my understanding, during k-NN searches, the HNSW graph for each segment must be queried, given that each segment retains its graph. Since segments are continually added during indexing, this could inherently lead to suboptimal search performance due to the necessity of traversing multiple graphs.

I would greatly value any additional references or insights you might have regarding improvements in parallel searches within segments.

The issue you linked to seem to have been at least partially implemented in Elasticsearch 8.12. This is an area under active development that is moving fast, but as I do not work for Elastic I have no insight into this so can not provide any feedback.

Hey @Ahmad2356

Let me try to answer all these questions in-line as possible and give additional insights.

From my understanding, during k-NN searches, the HNSW graph for each segment must be queried, given that each segment retains its graph. Since segments are continually added during indexing, this could inherently lead to suboptimal search performance due to the necessity of traversing multiple graphs.

This is true, but we also do the following:

  • We share information during exploration between segments to aid in early termination of the queries.
  • We query the segments in parallel, this is dependent on the number of CPUs on the server.

Just curious! why does it triple the amount of ram required for search? won't the search happen in primary copies?

Christian is correct, search is provided over all primaries and replicas. Consequently, I would suggest reducing the number of replicas as apparently you don't need the additional search parallelism.

In my search, I’ve noticed that many Elasticsearch users recommend first indexing the data, followed by a force merge to a single segment, and then performing the search to get accurate benchmarks.

I wouldn't force-merge, especially with continual indexing and updating occurring. But, you can adjust your merge policy so that segments are more aggressively merged during the lifetime of the index. Here is an initial recommendation:

        "policy": {
          "max_merged_segment": "25gb",
          "floor_segment": "1gb",
          "segments_per_tier": 5
        }

This increases the max segment size, adjusts the what is considered a "small" segment to a gigabyte, and reduces the segments per tier to 5 (from the default of 10).

I would keep the store preload.

Now, I have an additional question that needs clarification: By "initial queries" are you meaning only from a cold start and then all subsequent queries are fast? Or only repeated queries are fast?

You mention various latencies, I am wanting to confirm the latencies you are experiencing for a warmed up index. Note, "preload" is just a hint to the system and is not a guarantee that these files are loaded. However, I would adjust your "preload" to vex, veq, vem.

2 Likes

Thank you for the detailed explanation and recommendations!

I followed your advice and created a new index with the configurations you suggested:

"refresh_interval": "600s",
"number_of_shards": "6",
"merge": {
          "policy": {
            "segments_per_tier": "5",
            "floor_segment": "1gb",
            "max_merged_segment": "25gb"
          },
"store": {
  "preload": ["vex", "veq", "vem"]
},
"number_of_replicas": "1",

After setting up the index, I ran two services to test the performance:

  1. Indexing Service
  • This service indexed documents at a rate of ~1000 documents per second.
  • It started from 0 and indexed up to 5 million documents.
  1. Search Service
  • I started the search service when 3 million documents were indexed.
  • The search service is connected to Grafana to monitor query rates and response times.

Here are some additional observations:

  • While the indexing service was running, Elasticsearch segments grew from 0 to 680.
  • After shutting down the indexing service (at 15:18, as shown in the attached chart), segments started to merge gradually and decreased to around 80.
  • As segments decreased, search response times improved noticeably.
  • The lower the number of segments, the better the search response time was.

Regarding the additional question:

  • After the first query, subsequent queries were indeed faster, but the difference was not very significant.
  • Importantly, these queries were not repeated; each was unique.

If you have any further insights or suggestions to optimize this process, I’d greatly appreciate it!

Response time:

QPS: