I'm implementing a semantic search for product names, aiming to allow searches based on user intent. However, I'm observing that semantically, it retrieves fragmented data within the text. For example, when searching for "shirt with green tone," it often returns documents containing only the term "shirt" or even irrelevant results with the term "green," such as "Green Plant."
In textual search, this kind of treatment is manageable, but how can I enhance this concept using only semantic search?
I understand your point, but the issue is that, semantically, the results are already being returned incorrectly. When using a query that combines two search factors – textual and semantic – the results show inconsistencies.
While the combination of textual and semantic search is essential, I have observed that the semantic component misinterprets the meaning of certain words in their context. This compromises the relevance of the results.
Regarding the model, I am using it with context in Portuguese, although the examples here are in English.
This affects obtaining zero results. For example, when searching for an incoherent term, a textual search returns no results, whereas a semantic search always provides some semantically related random data.
I have observed that the semantic component misinterprets the meaning of certain words in their context.
This has to do with the model. I would suggest switching the model or, as I suggested already, boosting in combination with lexical results.
For example, when searching for an incoherent term, a textual search returns no results, whereas a semantic search always provides some semantically related random data.
@Shell_Dias this is just how embedding models work. They embed whatever you provide and we then return the nearest k neighbors.
Another solution is to just not use semantic search in this way at all, but instead searching image embeddings to boost products, or doing a first phase search only with BM25 and then rescoring with kNN, thus ensuring only the terms you care about are returned.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.