Efficient more-like-this using vector search

Hello friends,

I am learning about semantic search and vector embeddings and it’s making me love life more every day. What a time to be alive!

I’m thinking about- and experimenting with- ways I could implement this to enhance a “more-like-this” functionality for an e-commerce platform with many products (200k and growing). I believe this can help me mitigate the issue that product attributes are sometimes (often) inaccurate, incomplete or missing. We are approaching this issue from two sides: 1) improving input, and 2) improving search/filtering functionality. I think implementing vector similarity search can vastly improve the latter.

Currently I’m vectorising about 2 million images and indexing them into a separate index, which has already allowed me to create a POC for search-by-image functionality. But it’s not enough for more-like-this because the accuracy for similar products comes crashing down when there are backgrounds and complex scenes involved. So my plan is to combine image vector similarity with text vector similarity search for best results.

The problem I’m facing is that semantic_text and knn queries return only a small number of results. semantic_text returns 10 by default (with no options for configuring k, as far as I can find), and knn becomes increasingly slow when we increase num_candidates and k. And increasing k to into the hundreds does not seem feasible (though to be fair I’m testing this mostly locally and on a low-end staging environment - if I am wrong and bigger resources negate my issue, please correct me). But I may need to return hundreds or even more results, if applicable. In any case I don’t want to be limited to numbers below 100.

Could anyone point me in the right direction for my research? Any tips are greatly appreciated!

Edit:
ideally I would have no limit on the number of results returned - I’d much rather set a minimum _score value which we’d tweak experimentally. But that would mean that all documents would have to be considered, which is not feasible with knn. I thought ann might be of help here, but it turns out my dense_vectors are already has index set to true.

Edit 2:
So I didn’t even discuss: the image vector search is even more challenging because every product may have 5-15 pictures. So I have about 1.5 million documents in my product-image-embedding index right now, and the vectorisation is still going. I think I’ll end up with 2m+ documents. What I would want to do is get all matches for an input image, average the _scores of all matched images per product_id (which is stored on the document together with the embeddings) and rank the results based on those average scores. But currently I see no way of accomplishing this with such a large index.

Kind regards and thanks for all your work,

Martin

Welcome to the community!

You should be able to increase the size of semantic queries. However, you'll run into similar performance concerns with larger k values.

It sounds like you might be interested in exploring semantic reranking - late stage rerankers are more performant and may get you what you need. We also have a blog that you might find helpful on the topic.

Thanks for your suggestions Kathleen! Awesome stuff! I’ll delve into them later today.