Vectorised search in ES returning different result if i am changing the result size

my query looks like this -
below i am passing the size as well and if i am changing the size from 30 to 300 i am seeing completely different results.

{
    "query":
    {
        "bool":
        {
            
            "must":
            [
                {
                    "semantic":
                    {
                        "field": "SUMMARY_EMBEDDINGS",
                        "query": "dummy query"
                    }
                }
            ]
        }
    },
    "size": 30
}

hi @Gaurav_Duseja can you provide a little more information. What do your mappings looks like, how much data have you loaded, and what version of ES are you using? If you can provide an example of a loaded document and a query that would be valuable too.

Without any other information I know that the semantic query is backed by an ANN look up. The size here I think informs the defaults used when querying and specifically I believe it impacts the numbers of candidates explored when doing ANN search, which depending on what data and how the data is laid out could generate very different results simply because you haven't gotten enough candidates while exploring the space. So if size 30 for instance gives bad results and size 300 gives good results then I would expect this is the problem you are running into.

I believe, though I haven't tried this myself yet, that you can do a knn query against that semantic field you've defined as outlined here in the docs: Semantic query | Elasticsearch Guide [8.17] | Elastic. And with a knn query you have more control over the num_candidates and k parameters so you can experiment and see if that helps provide more consistency based on the data you have. Semantic search is evolving and will likely expose some of the knobs directly in the future, but I'm not sure when.

This blog may also help too if you want to explore a bit: Elasticsearch new semantic_text mapping: Simplifying semantic search - Elasticsearch Labs

Not entirely sure what license you have but if I remember correctly as well a Platinum or Enterprise license is needed to use semantic search so you may find you get better support by opening an issue.

hi @john-wagster thanks for the quick response
i am using ES version - 8.15.3
and the size of entries is around 1.1Lac
and below is the mapping for my embeddings

"SUMMARY_EMBEDDINGS": {
                    "type": "semantic_text",
                    "inference_id": "azure_openai_embeddings",
                    "model_settings": {
                        "task_type": "text_embedding",
                        "dimensions": 3072,
                        "similarity": "dot_product",
                        "element_type": "float"
                    }
                }

I think this happening because of KNN search, as i saw whenever i am increasing the size by 10 then the value of K is also increasing by 15 and results are getting better

Makes sense I would definitely try a knn query and see if you can tune and play around with the num_candidates and k values to get the results you are expecting. Let me know if you do and what you are seeing and I can try to provide some guidance based on the results. And then you might for now consider using that. Kind of depends on your objectives with your project. I'm fully expecting semantic queries to continue to evolve here.

this query is giving some more relevant results when increased the num_candidates

"query": {
        "bool": {
            "should": [
                {
                    "nested": {
                        "path": "SUMMARY_EMBEDDINGS.inference.chunks",
                        "query": {
                            "knn": {
                                "k": 10,
                                "num_candidates": 1000,
                                "field": "SUMMARY_EMBEDDINGS.inference.chunks.embeddings",
                                "query_vector_builder": {
                                    "text_embedding": {
                                        "model_id": "azure_openai_embeddings",
                                        "model_text": "dummy query"
                                    }
                                },
                                "boost": 10
                            }
                        }
                    }
                }
            ]
        }
    }

Makes a lot of sense. It really depends on the dataset and model you use for how the data is embedded within the space of all embeddings. If data is very similar you may find that you have to have a lot of candidates because internally we are having to explore a significantly larger portion of the underlying HNSW graph. 1000 candidates seems highish to me to get the best top 10 but it's definitely not crazy and this to me is a state of the current research problem where it's difficult to ascertain whether a model will generate good embeddings on your data without being an expert in the space. You might just try a few other models and see if you get less expensive results (better top 10 without so many candidates).