I am performing recall tests on indices containing CLIP vectors of varying sizes (ranging from 500,000 to 4,500,000 vectors). I am using HNSW (with the default parameters) with int8 quantization, and each index has a single shard.
It is worth noting that I am performing a text-to-image search: the query vectors are generated from text embeddings (using CLIP), while the vectors stored in the indices are image embeddings.
According to my tests, I am observing a counter-intuitive trend: smaller indices yield worse recall compared to larger ones.
Do you have any ideas on what could explain this behavior?
My initial reaction is it may depend on how many segments you have and the configuration of the knn field. A smaller index should have have less segments (they’ll be an HNSW graph built out on each segment). But if you had force merged for instance on a larger index then this would merge the segments potentially leading to better recall particularly depending on the degree of oversampling. It could also be that for this dataset you need to increase m, the number of neighbors that are connected in the graph for small dataset to have the necessary connections particularly if within the 500k portion of the dataset it looks very different than the remainder of the 4.5m.
But there’s a lot of variables that could be impacting you. What version of ES are you running? And can you share a few things with us like the mappings (so we can see the configuration on the knn field(s)) and also share the query being made? It might be interesting to see segment counts/info _cat/segments?v=true&format=json so we can see how many segments are in say one of the large ones and one of the small ones to rule out merges as a source of confusion. And then lastly it would be good to understand what you specifically mean by worse recall. Can you share how you are doing the evaluation and the specific recall discrepancy between them? It strikes me that if you for instance have an evaluation set that’s not adjusted appropriately for the 500k dataset you may be missing hits that are in the remainder of the 4.5m dataset for instance.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.