Weird recall trend with HNSW int8 quantization: Smaller indices show worse recall

viperv · December 31, 2025, 6:38am

Hi,

I am performing recall tests on indices containing CLIP vectors of varying sizes (ranging from 500,000 to 4,500,000 vectors). I am using HNSW (with the default parameters) with int8 quantization, and each index has a single shard.

It is worth noting that I am performing a text-to-image search: the query vectors are generated from text embeddings (using CLIP), while the vectors stored in the indices are image embeddings.

According to my tests, I am observing a counter-intuitive trend: smaller indices yield worse recall compared to larger ones.

Do you have any ideas on what could explain this behavior?

Thanks!

john-wagster · December 31, 2025, 4:03pm

My initial reaction is it may depend on how many segments you have and the configuration of the knn field. A smaller index should have have less segments (they’ll be an HNSW graph built out on each segment). But if you had force merged for instance on a larger index then this would merge the segments potentially leading to better recall particularly depending on the degree of oversampling. It could also be that for this dataset you need to increase m, the number of neighbors that are connected in the graph for small dataset to have the necessary connections particularly if within the 500k portion of the dataset it looks very different than the remainder of the 4.5m.

But there’s a lot of variables that could be impacting you. What version of ES are you running? And can you share a few things with us like the mappings (so we can see the configuration on the knn field(s)) and also share the query being made? It might be interesting to see segment counts/info _cat/segments?v=true&format=json so we can see how many segments are in say one of the large ones and one of the small ones to rule out merges as a source of confusion. And then lastly it would be good to understand what you specifically mean by worse recall. Can you share how you are doing the evaluation and the specific recall discrepancy between them? It strikes me that if you for instance have an evaluation set that’s not adjusted appropriately for the 500k dataset you may be missing hits that are in the remainder of the 4.5m dataset for instance.

Topic		Replies	Views
Dense Vector Field Extremely Large Elasticsearch vector-search	12	143	October 6, 2025
Vector search large dense vectors performance issues Elasticsearch vector-search	3	130	July 16, 2025
Would int8_hnsw slower than hnsw for vector search Elasticsearch vector-search	5	190	July 23, 2025
Poor knn results from some nodes Elastic Search	7	602	May 27, 2024
Partition HNSW graph per user, elastic KNN Elasticsearch vector-search	5	573	December 15, 2023

Weird recall trend with HNSW int8 quantization: Smaller indices show worse recall

Related topics