Hello friends,
I am learning about semantic search and vector embeddings and it’s making me love life more every day. What a time to be alive!
I’m thinking about- and experimenting with- ways I could implement this to enhance a “more-like-this” functionality for an e-commerce platform with many products (200k and growing). I believe this can help me mitigate the issue that product attributes are sometimes (often) inaccurate, incomplete or missing. We are approaching this issue from two sides: 1) improving input, and 2) improving search/filtering functionality. I think implementing vector similarity search can vastly improve the latter.
Currently I’m vectorising about 2 million images and indexing them into a separate index, which has already allowed me to create a POC for search-by-image functionality. But it’s not enough for more-like-this because the accuracy for similar products comes crashing down when there are backgrounds and complex scenes involved. So my plan is to combine image vector similarity with text vector similarity search for best results.
The problem I’m facing is that semantic_text
and knn
queries return only a small number of results. semantic_text
returns 10 by default (with no options for configuring k
, as far as I can find), and knn
becomes increasingly slow when we increase num_candidates
and k
. And increasing k
to into the hundreds does not seem feasible (though to be fair I’m testing this mostly locally and on a low-end staging environment - if I am wrong and bigger resources negate my issue, please correct me). But I may need to return hundreds or even more results, if applicable. In any case I don’t want to be limited to numbers below 100.
Could anyone point me in the right direction for my research? Any tips are greatly appreciated!
Edit:
ideally I would have no limit on the number of results returned - I’d much rather set a minimum _score
value which we’d tweak experimentally. But that would mean that all documents would have to be considered, which is not feasible with knn
. I thought ann
might be of help here, but it turns out my dense_vectors
are already has index
set to true
.
Edit 2:
So I didn’t even discuss: the image vector search is even more challenging because every product may have 5-15 pictures. So I have about 1.5 million documents in my product-image-embedding
index right now, and the vectorisation is still going. I think I’ll end up with 2m+ documents. What I would want to do is get all matches for an input image, average the _scores
of all matched images per product_id
(which is stored on the document together with the embeddings) and rank the results based on those average scores. But currently I see no way of accomplishing this with such a large index.
Kind regards and thanks for all your work,
Martin