Confusede result from multilingual-e5-small

Shell_Dias · August 5, 2024, 2:25pm

I am currently working with multilingual-e5-small and using semantic search. However, when retrieving search results, they seem inconclusive compared to the search query.

My actual question is how to evaluate the model more accurately with indexed embeddings. How is the search done with the tokens?

When comparing embeddings from training with those from indexing via the pipeline, they interestingly differ with the same text.

With elser_v2, there is an interesting explanation on this topic for some raised issues, Improving text expansion performance using token pruning — Search Labs (elastic.co) but for other models, it's somewhat unclear.

I currently don't work with the proposed language for elser_v2

Shell_Dias · August 15, 2024, 2:53am

abra cadabra?

dkyle · August 15, 2024, 8:35am

Inputs to the E5 family of models should be prefixed with either query: or passage: this is how the model was trained (see the FAQ on HuggingFace).

Elasticsearch automatically adds the passage: prefix to inputs as they are ingested and query: to search inputs. This would explains the different embedding values you are seeing.

How is the search done with the tokens?

Search is performed in the vector database. The query text is converted to an embedding then the vector database is used to find other embeddings that are close to the search embedding.

Shell_Dias · August 15, 2024, 1:09pm

Regarding the use of the 'query' and 'passage' prefixes, I wasn't aware that Elasticsearch already adds these automatically—perhaps I missed that part of the documentation.

Now, about the issue with vectors converted by Elasticsearch for search, I have a specific scenario in mind. Taking a demonstrated example of products:

When searching for "shirts," it sometimes returns "sneakers"—terms that have no relation to the search. I believe this is related to the FAQ:

2. Why are my reproduced results slightly different from reported in the model card?
Different versions of transformers and pytorch could cause negligible but non-zero performance differences

Given that it will return 'random' information like this...

For textual search, Elasticsearch handles search relevance using approaches like TF-IDF, which makes it easy to identify the relevance of a document.

But with vectors, how does it calculate the score so that "sneakers" don't appear in my search for "shirts"?

How can Elasticsearch resolve this in its semantic search?

Shell_Dias · October 28, 2024, 4:18am

healthy?

Topic		Replies	Views
Text embedding different in elastic search and python library Elasticsearch elastic-stack-machine-learning , vector-search	6	1050	July 5, 2023
Text Embedding in Elastic Search Elasticsearch	2	426	June 3, 2021
[KNN] Semantics Affected by Phrase Similarity Elasticsearch elastic-stack-machine-learning , vector-search	5	31	December 4, 2024
Sentence vector comparisons. Without titles / less direct responses Elasticsearch	2	428	October 16, 2019
Dense search for large documents Elasticsearch vector-search	5	183	January 10, 2024

Confusede result from multilingual-e5-small

Related topics