Filter vector search results to get only relevant documents?

I am not an expert in elastic queries, I have not found a solution to filter my results. My index contains 450,000 documents. The issue is that when I perform a search, it always returns all 450,000 documents, sorted by relevance. However, upon examining the last results, some of them do not even match my query. Therefore, I am considering the idea of limiting and filtering the results only for the documents with a similarity score greater than 0.5. This way, we can ensure that we will only get relevant results.

mappings of field :
"vector": {
"type": "dense_vector",
"dims": 768,
"index": false
},

my query like :


query["bool"]["should"].append({
"script_score": {
"query": {"match_all": {}},
"script": {
"source": "(1.0 + cosineSimilarity(params['query_vector'], 'vector'))",
"params": {"query_vector": vector_field}
}
}
})

i want to get only docs that have :

cosineSimilarity(params['query_vector'], 'vector') > 0.5

Heya @john_nicolas ,

A thing to try would be using the min_score parameter: Search API | Elasticsearch Guide [8.8] | Elastic

This requires documents to have at least this score if they are going to be included in the result set.

ok i tired it but it does'nt work or i dont know how we can use it

the problem is that i use a combined query that combines a simple search with the vector search like this :

query["bool"]["should"].append({
                    "multi_match": {
                        "query": value,
                        "fields": ["title_txt_ml", "content_txt_ml"],
                        "operator": "and",
                        "boost": 80.0
                    }
                })
query["bool"]["should"].append({
                        "multi_match": {
                            "query": value,
                            "fields": ["title_txt_ml", "content_txt_ml"],
                            "operator": "and",
                            "type": "phrase",
                            "slop": 1,
                            "boost": 20.0
                        }
                    })
query["bool"]["should"].append({
                        "script_score": {
                            "query": {"match_all": {}},
                            "script": {
                                "source": "(1.0 + cosineSimilarity(params['query_vector'], 'vector'))",
                                "params": {"query_vector": vector_field}
                            },
                            "boost":0.5
                        }
                    })

so i connot define a min_score for pertinence of docs , i think that the solution is to filter results of only vector search , i.e in the bloc

query["bool"]["should"].append({
                        "script_score": {
                            "query": {"match_all": {}},
                            "script": {
                                "source": "(1.0 + cosineSimilarity(params['query_vector'], 'vector'))",
                                "params": {"query_vector": vector_field}
                            },
                            "boost":0.5
                        }
                    })

Ah, you only want min_score for vectors but continue to get the docs for other matching queries.

In newer versions knn has this capability: k-nearest neighbor (kNN) search | Elasticsearch Guide [8.9] | Elastic using calculated similarity of the document.

But, I will have to spend some time thinking on how to do this with brute-force knn.

We want to make all these interactions cleaner, the API will improve :).

Hey @john_nicolas you mentioned you tried min_score, but did you try min_score within the script_score?

{
                        "script_score": {
                            "query": {"match_all": {}},
                            "script": {
                                "source": "(1.0 + cosineSimilarity(params['query_vector'], 'vector'))",
                                "params": {"query_vector": vector_field}
                            },
                            "min_score": 0.5,
                            "boost":0.5
                        }
                    }

Something that may be of help in deciding what score threshold to pick:

If you have many category fields (eg department:kitchen utensils) you can use these to see what score ranges produce a bewildering number of values (indicating random rather than cohesive set of concepts)
In this visualisation query score is on the x-axis and the breakdown of document categories matching that score band is in the vertical bar. The high-scoring documents come from a small selection of related categories while the lower scoring documents come from a huge number of unrelated categories. In a way the number of categories provide a measure of how many different meanings the matches in a range have and give a reasonable indication of where there are meaningless results in the score bands.

exactly i want min_score only for vectors but continue get the docs for others matching queries
i change my method :
`

"knnvector": { "type": "elastiknn_dense_float_vector", "elastiknn": { "dims": 768, "model": "lsh", "similarity": "angular", "L": 99, "k": 1 } }

`

, i tried this to put min_score in the query but it is not work for me


payload['query'] = { "elastiknn_nearest_neighbors": { "field": "knnvector", "vec": { "values": vector_field, }, "model": "lsh", "similarity": "angular", "candidates": 5000, "k": 5, "min_score":1.5 } }

elastiknn_dense_float_vector isn't officially supported by Elasticsearch, and I am not even sure its currently maintained.

If you want approximate nearest neighbors, you should try dense_vector with index: true. That allows you to set expected similarity thresholds at query time.

Or you can continue to use script_query and set min_score for that specific query.

it does'nt work the min_score for the specificc query
i tried this but not work for me :

 "script_score": {
                            "query": {"match_all": {}},
                            "script": {
                                "source": "(1.0 + cosineSimilarity(params['query_vector'], 'vector'))",
                                "params": {"query_vector": vector_field}
                            },
                            "min_score": 1.5,
                        }

ok so finaly i used :

      "esvector": {
        "type": "dense_vector",
        "dims": 768,
        "index": true,
        "similarity": "cosine"
      },

i tried this query :

        payload['query'] =my_complex_queries
        payload['knn'] ={
                    "field": "esvector",
                    "query_vector": vector_field,
                    "candidates":100,
                    "k":5,
                    "similarity":0.5
                  }

but does'nt work , we can pass similarity threshold like this ?

@john_nicolas

Could you clarify "but does'nt work" ? Are you getting errors? Unexpected irrelevant results?

Could you also let us know your Elasticsearch version?

I am surprised that the min_score within the script_query didn't work either.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.