I am currently working with multilingual-e5-small and using semantic search. However, when retrieving search results, they seem inconclusive compared to the search query.
My actual question is how to evaluate the model more accurately with indexed embeddings. How is the search done with the tokens?
When comparing embeddings from training with those from indexing via the pipeline, they interestingly differ with the same text.
Inputs to the E5 family of models should be prefixed with either query: or passage: this is how the model was trained (see the FAQ on HuggingFace).
Elasticsearch automatically adds the passage: prefix to inputs as they are ingested and query: to search inputs. This would explains the different embedding values you are seeing.
How is the search done with the tokens?
Search is performed in the vector database. The query text is converted to an embedding then the vector database is used to find other embeddings that are close to the search embedding.
Regarding the use of the 'query' and 'passage' prefixes, I wasn't aware that Elasticsearch already adds these automatically—perhaps I missed that part of the documentation.
Now, about the issue with vectors converted by Elasticsearch for search, I have a specific scenario in mind. Taking a demonstrated example of products:
When searching for "shirts," it sometimes returns "sneakers"—terms that have no relation to the search. I believe this is related to the FAQ:
2. Why are my reproduced results slightly different from reported in the model card?
Different versions of transformers and pytorch could cause negligible but non-zero performance differences
Given that it will return 'random' information like this...
For textual search, Elasticsearch handles search relevance using approaches like TF-IDF, which makes it easy to identify the relevance of a document.
But with vectors, how does it calculate the score so that "sneakers" don't appear in my search for "shirts"?
How can Elasticsearch resolve this in its semantic search?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.