Elasticsearch Aproximate KNN search using custom score calculation using a custom priority field

ravi_kumar_mvs · March 15, 2024, 11:37pm

Hi, I use Enterprise elasticsearch and ingest documents using web crawler. The data is pre-processed using an Inference pipeline which creates vectors for title, meta_description fields before ingesting the docs to the indexes. I also assign a field named "priority" for all the documents based on the page URL (crawled by the web crawler).
Things were fine for English markets when using ELSER embeddings + script_score to customize the documents score, but since now we have started working with non-English locales and creating embeddings using E5 model, I'm struggling to find an example for running approximate KNN search as well as customizing the score values using a script_score. Also note that I cannot use RRF as well, since I need highlighted feilds.
Below is the score calculation done for English markets with ELSER embeddings:

"query": {
        "script_score": {
            "query": {
                "bool": {
                    "should": [
                        {
                            "multi_match": {
                                "query": {{query_string_temp}},
                                "fields": [
                                    "body_content^3.0",
                                    "headings^4.0",
                                    "meta_description^4.0",
                                    "meta_keywords.text^4.0",
                                    "title^5.0"
                                ]
                            }
                        },
                        {
                            "text_expansion": {
                                "ml.inference.title_expanded.predicted_value": {
                                    "model_text": {{query_string_temp}},
                                    "model_id": ".elser_model_2_linux-x86_64",
                                    "boost": 5.0
                                }
                            }
                        },
                        {
                            "text_expansion": {
                                "ml.inference.meta_description_expanded.predicted_value": {
                                    "model_text": {{query_string_temp}},
                                    "model_id": ".elser_model_2_linux-x86_64",
                                    "boost": 4.0
                                }
                            }
                        },
                        {
                            "text_expansion": {
                                "ml.inference.meta_keywords_expanded.predicted_value": {
                                    "model_text": {{query_string_temp}},
                                    "model_id": ".elser_model_2_linux-x86_64",
                                    "boost": 4.0
                                }
                            }
                        },
                        {
                            "text_expansion": {
                                "ml.inference.headings_expanded.predicted_value": {
                                    "model_text": {{query_string_temp}},
                                    "model_id": ".elser_model_2_linux-x86_64",
                                    "boost": 4.0
                                }
                            }
                        },
                        {
                            "bool": {
                                "boost": 1.0
                            }
                        },
                        {
                            "bool": {
                                "boost": 1.0
                            }
                        }
                    ],
                    "minimum_should_match": "1",
                    "boost": 1.0
                }
            },
            "script": {
                "source": "if(doc['priority'].size() > 0){ return _score*(1-(doc['priority'].value-1)/10) }",
                "lang": "painless"
            }
        }
    },
    "min_score": 50.0

BenTrent · March 18, 2024, 8:41pm

Hey @ravi_kumar_mvs

You should be able to use: Knn query | Elasticsearch Guide [8.12] | Elastic since its just a query, it can be part of a script_score query.

Now, this doesn't have the query_builder logic in it yet like the regular top-level knn object. But this was recently merged, so should be in 8.14:

github.com/elastic/elasticsearch

Add modelId and modelText to KnnVectorQueryBuilder

elastic:main ← tteofili:knn_dsl_modeltext_modelid

opened 12:12PM - 07 Mar 24 UTC

tteofili

+680 -140

Make it possible to perform KNN queries by supplying `model_text` and `model_id`… instead of the `query_vector`. This makes use of a `QueryVectorBuilder`. Supplying a `text_embedding` `query_vector_builder` with `model_text` and `model_id` instead of the `query_vector` will result in the generation of a `query_vector` by calling inference (during query rewrite) on the specified `model_id` with the supplied `model_text`. This is consistent with the way query vectors are built from `model_id` / `model_text` in `KnnSearchBuilder` (DFS phase). Sample query: ```json { "query": { "knn" : { "field": "embedding", "num_candidates": 10, "query_vector_builder": { "text_embedding": { "model_id": "bert_base", "model_text": "lucene is all you need" } } } } } ``` See also https://docs.google.com/document/d/12SYyHbPbCzhPYQ65HiRMesANObvzWDrBDm0Qn9QSwFQ/edit#heading=h.r3mn4wd2it4e

ravi_kumar_mvs · March 18, 2024, 9:55pm

Hi BenTrent,
Thank you for your time and reply.
Yes, I have tried that. And it works when I don't use script_score or function_score queries to modify the score.
For example, the below query works:

{
    "query": {
        "multi_match": {
            "query": "{{query_string_temp}}",
            "fields": [
                "body_content^3.0",
                "headings^4.0",
                "meta_description^4.0",
                "meta_keywords.text^4.0",
                "title^5.0"
            ]
        }
    },
    "knn": [
        {
            "field": "ml.inference.vector_title.predicted_value",
            "query_vector_builder": {
                "text_embedding": {
                    "model_id": "multilingual-e5-small",
                    "model_text": "{{query_string_temp}}"
                }
            },
            "k": 5,
            "num_candidates": 100,
            "boost":5
        },
        {
            "field": "ml.inference.vector_headings.predicted_value",
            "query_vector_builder": {
                "text_embedding": {
                    "model_id": "multilingual-e5-small",
                    "model_text": "{{query_string_temp}}"
                }
            },
            "k": 5,
            "num_candidates": 100,
            "boost":4
        },
        {
            "field": "ml.inference.vector_meta_description.predicted_value",
            "query_vector_builder": {
                "text_embedding": {
                    "model_id": "multilingual-e5-small",
                    "model_text": "{{query_string_temp}}"
                }
            },
            "k": 5,
            "num_candidates": 100,
            "boost":4
        },
        {
            "field": "ml.inference.vector_meta_keywords.predicted_value",
            "query_vector_builder": {
                "text_embedding": {
                    "model_id": "multilingual-e5-small",
                    "model_text": "{{query_string_temp}}"
                }
            },
            "k": 5,
            "num_candidates": 100,
            "boost":4
        }
    ]
}

But this one doesn't work:

{
    "from": 0,
    "size": 10,
    "query": {
        "script_score": {
            "query": {
                "bool": {
                    "should": [
                        {
                            "multi_match": {
                                "query": "{{query_string_temp}}",
                                "fields": [
                                    "body_content^3.0",
                                    "headings^4.0",
                                    "meta_description^4.0",
                                    "meta_keywords.text^4.0",
                                    "title^5.0"
                                ]
                            }
                        },
                        {
                            "knn": [
                                {
                                    "field": "ml.inference.vector_title.predicted_value",
                                    "query_vector_builder": {
                                        "text_embedding": {
                                            "model_id": "multilingual-e5-small",
                                            "model_text": "{{query_string_temp}}"
                                        }
                                    },
                                    "k": 5,
                                    "num_candidates": 100
                                },
                                {
                                    "field": "ml.inference.vector_headings.predicted_value",
                                    "query_vector_builder": {
                                        "text_embedding": {
                                            "model_id": "multilingual-e5-small",
                                            "model_text": "{{query_string_temp}}"
                                        }
                                    },
                                    "k": 5,
                                    "num_candidates": 100
                                },
                                {
                                    "field": "ml.inference.vector_meta_description.predicted_value",
                                    "query_vector_builder": {
                                        "text_embedding": {
                                            "model_id": "multilingual-e5-small",
                                            "model_text": "{{query_string_temp}}"
                                        }
                                    },
                                    "k": 5,
                                    "num_candidates": 100
                                },
                                {
                                    "field": "ml.inference.vector_meta_keywords.predicted_value",
                                    "query_vector_builder": {
                                        "text_embedding": {
                                            "model_id": "multilingual-e5-small",
                                            "model_text": "{{query_string_temp}}"
                                        }
                                    },
                                    "k": 5,
                                    "num_candidates": 100
                                }
                            ]
                        }
                    ]
                }
            },
            "script": {
                "source": "if(doc['priority'].size() > 0){ return _score*(1-(doc['priority'].value-1)/10) }",
                "lang": "painless"
            }
        }
    },
    "min_score": 50.0
}

Also my elasticsearch cloud cluster is on : 8.11.4

If I understand your comment, when top level "knn" queries are supported (with 8.14) I'll be able to use them for script_score or function_score, correct? If so, any other alternatives to support customizing score at the moment?

BenTrent · March 19, 2024, 12:36pm

Ah, what I am talking about is this:
Knn query | Elasticsearch Guide [8.12] | Elastic

Which is is available in 8.12+

But, you are using the query_vector_builder interface as well, which isn't available for the query until 8.14.

So, currently, there isn't a way to do what you want to do exactly until 8.14 is released.

What your query would look like in 8.14:

{
    "from": 0,
    "size": 10,
    "query": {
        "script_score": {
            "query": {
                "bool": {
                    "should": [
                        {
                            "multi_match": {
                                "query": "{{query_string_temp}}",
                                "fields": [
                                    "body_content^3.0",
                                    "headings^4.0",
                                    "meta_description^4.0",
                                    "meta_keywords.text^4.0",
                                    "title^5.0"
                                ]
                            }
                        },
                        {
                            "knn": {
                                "field": "ml.inference.vector_title.predicted_value",
                                "query_vector_builder": {
                                    "text_embedding": {
                                        "model_id": "multilingual-e5-small",
                                        "model_text": "{{query_string_temp}}"
                                    }
                                },
                                "num_candidates": 100
                            }
                        },
                        {
                            "knn": {
                                "field": "ml.inference.vector_headings.predicted_value",
                                "query_vector_builder": {
                                    "text_embedding": {
                                        "model_id": "multilingual-e5-small",
                                        "model_text": "{{query_string_temp}}"
                                    }
                                },
                                "num_candidates": 100
                            }
                        },
                        {
                            "knn": {
                                "field": "ml.inference.vector_meta_description.predicted_value",
                                "query_vector_builder": {
                                    "text_embedding": {
                                        "model_id": "multilingual-e5-small",
                                        "model_text": "{{query_string_temp}}"
                                    }
                                },
                                "num_candidates": 100
                            }
                        },
                        {
                            "knn": {
                                "field": "ml.inference.vector_meta_keywords.predicted_value",
                                "query_vector_builder": {
                                    "text_embedding": {
                                        "model_id": "multilingual-e5-small",
                                        "model_text": "{{query_string_temp}}"
                                    }
                                },
                                "num_candidates": 100
                            }
                        }
                    ]
                }
            }
        },
        "script": {
            "source": "if(doc['priority'].size() > 0){ return _score*(1-(doc['priority'].value-1)/10) }",
            "lang": "painless"
        }
    },
    "min_score": 50.0
}

You could do this in 8.12 if you replace the "query_vector_builder": entries with query_vector and the already embedded text.

ravi_kumar_mvs · March 19, 2024, 1:00pm

Really appreciate your help.

So as I understand, the best option for now is to make 2 network calls. 1 call to fetch query embeddings and the second call to make the actual query call .

ravi_kumar_mvs · March 20, 2024, 1:33pm

Hi BenTrent,

Unfortunately even query_vector queries are not working with 8.11.4. I have tried the below which worked :

{
    "query": {
        "multi_match": {
            "query": "{{query_string_temp}}",
            "fields": [
                "body_content^3.0",
                "headings^4.0",
                "meta_description^4.0",
                "meta_keywords.text^4.0",
                "title^5.0"
            ]
        }
    },
    "knn": [
        {
            "field": "ml.inference.vector_title.predicted_value",
            "query_vector": {{query_string_temp_vectors}},
            "k": 5,
            "num_candidates": 100
        },
        {
            "field": "ml.inference.vector_headings.predicted_value",
            "query_vector": {{query_string_temp_vectors}},
            "k": 5,
            "num_candidates": 100
        },
        {
            "field": "ml.inference.vector_meta_description.predicted_value",
            "query_vector": {{query_string_temp_vectors}},
            "k": 5,
            "num_candidates": 100
        },
        {
            "field": "ml.inference.vector_meta_keywords.predicted_value",
            "query_vector": {{query_string_temp_vectors}},
            "k": 5,
            "num_candidates": 100
        }
    ]
}

But the below still fails:

"query": {
        "script_score": {
            "query": {
                "bool": {
                    "should": [
                        {
                            "multi_match": {
                                "query": "{{query_string_temp}}",
                                "fields": [
                                    "body_content^3.0",
                                    "headings^4.0",
                                    "meta_description^4.0",
                                    "meta_keywords.text^4.0",
                                    "title^5.0"
                                ]
                            }
                        },
                        {
                            "knn": [
                                {
                                    "field": "ml.inference.vector_title.predicted_value",
                                    "query_vector": {{query_string_temp_vectors}},
                                    "k": 5,
                                    "num_candidates": 100
                                },
                                {
                                    "field": "ml.inference.vector_headings.predicted_value",
                                    "query_vector": {{query_string_temp_vectors}},
                                    "k": 5,
                                    "num_candidates": 100
                                },
                                {
                                    "field": "ml.inference.vector_meta_description.predicted_value",
                                    "query_vector": {{query_string_temp_vectors}},
                                    "k": 5,
                                    "num_candidates": 100
                                },
                                {
                                    "field": "ml.inference.vector_meta_keywords.predicted_value",
                                    "query_vector": {{query_string_temp_vectors}},
                                    "k": 5,
                                    "num_candidates": 100
                                }
                            ]
                        }
                    ]
                }
            },
            "script": {
                "source": "if(doc['priority'].size() > 0){ return _score*(1-(doc['priority'].value-1)/10) }",
                "lang": "painless"
            }
        }
    },
    "min_score": 50.0

{{query_string_temp_vectors}} is the actual vector with the same dims as the query model expects. [0.04206765815615654,-0.03618289530277252,..etc]

Error message: "reason": "[knn] query malformed, no start_object after query name",

BenTrent · March 20, 2024, 1:52pm

Correct, knn in the query clause is only available in 8.12+

ravi_kumar_mvs · March 21, 2024, 8:41am

Hi BenTrent,
We have upgraded to 8.12.2 and lost all our web crawler indexes. We will evaluate why we lost them.
Also unfortunately 8.12.2 still doesn't support array of knn requests with query_vector.
For example, below query works:

{
  "size" : 3,
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "title": {
              "query": "{{query_string_temp}}",
              "boost": 1
            }
          }
        },
        {
          "knn": {
            "field": "ml.inference.vector_title.predicted_value",
            "query_vector": {{query_string_temp_vectors}},
            "num_candidates": 10,
            "boost": 1
          }
        }
      ]
    }
  },
  "_source":["title"]
}

But not the below one:

{
    "size": 3,
    "query": {
        "bool": {
            "should": [
                {
                    "match": {
                        "title": {
                            "query": "{{query_string_temp}}",
                            "boost": 1
                        }
                    }
                },
                {
                    "knn": [
                        {
                            "field": "ml.inference.vector_title.predicted_value",
                            "query_vector": {{query_string_temp_vectors}},
                            "num_candidates": 10,
                            "boost": 1
                        },
                        {
                            "field": "ml.inference.vector_headings.predicted_value",
                            "query_vector": {{query_string_temp_vectors}},
                            "k": 5,
                            "num_candidates": 100
                        }
                    ]
                }
            ]
        }
    },
    "_source": [
        "title"
    ]
}

Error message:

"error": {
        "root_cause": [
            {
                "type": "parsing_exception",
                "reason": "[knn] query malformed, no start_object after query name",
                "line": 15,
                "col": 28
            }
        ],
        "type": "x_content_parse_exception",
        "reason": "[15:28] [bool] failed to parse field [should]",
        "caused_by": {
            "type": "parsing_exception",
            "reason": "[knn] query malformed, no start_object after query name",
            "line": 15,
            "col": 28
        }
    },
    "status": 400

Looks like I'm blocked. Any suggestions?

BenTrent · March 21, 2024, 11:10am

Sorry about your difficulties.

Each knn query is an individual query. See my example (shortened for brevity):

{
    "from": 0,
    "size": 10,
    "query": {
        "script_score": {
            "query": {
                "bool": {
                    "should": [
                        {
                            "knn": {
                                "field": "ml.inference.vector_title.predicted_value",
                                "query_vector": [1,2,3],
                                "num_candidates": 100
                            }
                        },
                        {
                            "knn": {
                                "field": "ml.inference.vector_headings.predicted_value",
                                "query_vector": [1,2,3],
                                "num_candidates": 100
                            }
                        },
                        ...
                    ]
                }
            }
        },
        "script": {
            "source": "if(doc['priority'].size() > 0){ return _score*(1-(doc['priority'].value-1)/10) }",
            "lang": "painless"
        }
    },
    "min_score": 50.0
}

You can see many examples here on this particular doc page: Knn query | Elasticsearch Guide [8.12] | Elastic

ravi_kumar_mvs · March 21, 2024, 11:37am

Thank you. We will try this with 2 network calls.
Any possibility of achieving this in 1 Elasticsearch API call using 8.12.2? My be painless scripts or some other way?

BenTrent · March 21, 2024, 12:14pm

Not in 8.12. In 8.14 the knn query will be brought into parity with the top level knn object.

ravi_kumar_mvs · March 21, 2024, 2:21pm

Oh. Ok. May be we are too early into it. We will then use ingestion _simulate endpoint to form the embeddings of the user query and then pass the embeddings to the actual search template. Looks like that's the only option for now to execute the whole flow within elasticsearch platform.

Also any estimated date for the 8.14 release on elastic cloud ?

BenTrent · March 21, 2024, 2:33pm

We don't have any estimates on its release date.

system · April 18, 2024, 2:34pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Best method for calculating text embedding for a KNN search? Elasticsearch elastic-stack-machine-learning , runtime-fields , vector-search	2	676	May 2, 2023
Vector Scoring Elasticsearch	6	6142	June 18, 2017
Scoring tagged documents with custom scores Elasticsearch	2	826	June 4, 2020
CustomScoreQuery in ElasticSearch Elasticsearch	3	496	July 6, 2017
KNN search with Function score or Scoring script Elasticsearch vector-search	4	638	April 26, 2024

Elasticsearch Aproximate KNN search using custom score calculation using a custom priority field

Related topics