[KNN] Semantics Affected by Phrase Similarity

I'm implementing a semantic search for product names, aiming to allow searches based on user intent. However, I'm observing that semantically, it retrieves fragmented data within the text. For example, when searching for "shirt with green tone," it often returns documents containing only the term "shirt" or even irrelevant results with the term "green," such as "Green Plant."

In textual search, this kind of treatment is manageable, but how can I enhance this concept using only semantic search?

          {
            "knn": {
              "field": "name_embedding",
              "num_candidates": 100,
              "query_vector_builder": {
                "text_embedding": {
                  "model_id": ".multilingual-e5-small_linux-x86_64",
                  "model_text": "shirt with green tone"
                }
              }
            }
          }

Elastic Version 8.15

here?

@Shell_Dias I would suggest combining both lexical signals (text search) and semantic search.

Embeddings are no "silver bullet" and when it comes to simple term focused queries, lexical search still performs very well.

Another option is simply trying a different model, but I am not sure what the best model would be for your use case.

I understand your point, but the issue is that, semantically, the results are already being returned incorrectly. When using a query that combines two search factors – textual and semantic – the results show inconsistencies.

While the combination of textual and semantic search is essential, I have observed that the semantic component misinterprets the meaning of certain words in their context. This compromises the relevance of the results.

Regarding the model, I am using it with context in Portuguese, although the examples here are in English.

This affects obtaining zero results. For example, when searching for an incoherent term, a textual search returns no results, whereas a semantic search always provides some semantically related random data.

I have observed that the semantic component misinterprets the meaning of certain words in their context.

This has to do with the model. I would suggest switching the model or, as I suggested already, boosting in combination with lexical results.

For example, when searching for an incoherent term, a textual search returns no results, whereas a semantic search always provides some semantically related random data.

@Shell_Dias this is just how embedding models work. They embed whatever you provide and we then return the nearest k neighbors.

There are ways to mitigate these behaviors, one example is requiring an overall min_score via function_score Function score query | Elasticsearch Guide [8.16] | Elastic

This way neighbors that are exceptionally far away are removed.

There is also filter the k neighbors directly in kNN with: k-nearest neighbor (kNN) search | Elasticsearch Guide [8.16] | Elastic

Another solution is to just not use semantic search in this way at all, but instead searching image embeddings to boost products, or doing a first phase search only with BM25 and then rescoring with kNN, thus ensuring only the terms you care about are returned.