Is it possible to do highlighting with knn and/or hybrid search?

Is it possible to do highlighting with knn and/or hybrid search?

Welcome!

I don't think you can with knn but with hybrid search, that should be possible (on the classic search text queries only).

Thanks @dadoonet. I tried it with no success. Can you share sample code?

Let's do the opposite. You start to create a full example which does not work for you and we iterate from that.

I'd love to see a full recreation script as described in About the Elasticsearch category. It will help to better understand what you are doing. Please, try to keep the example as simple as possible.

A full reproduction script is something anyone can copy and paste in Kibana dev console, click on the run button to reproduce your use case. It will help readers to understand, reproduce and if needed fix your problem. It will also most likely help to get a faster answer.

Have a look at the Elastic Stack and Solutions Help · Forums and Slack | Elastic page. It contains also lot of useful information on how to ask for help.

import os
from elasticsearch import Elasticsearch
# Define Elasticsearch connection
es = Elasticsearch(
        os.environ["ELASTICSEARCH_HOST"],
        http_auth=(os.environ["ELASTIC_USERNAME"], os.environ["ELASTIC_PASSWORD"]),
        verify_certs=False,
    )


def get_embedding(text, model="text-embedding-3-large", dimensions=1024):
    # OpenAI "text-embedding-3-large" model's default dimension is 3072. Possible values are 256, 512, 1024, and 3072.
    text = text.replace("\n", " ")
    return client.embeddings.create(input = [text], model=model, dimensions=dimensions).data[0].embedding

index = "poc_index"


from IPython.display import display, HTML
import re  # Regular expressions


def search_es_with_highlighting_and_slides(
    query,
    k=5,
    es=es,
    index=index,
    _source=["file_metadata.filename", "client_name", "project_name", "alltexts"],
):
    # Define search query with highlighting
    search_query = {
        "query": {
            "match": {"alltexts": {"query": query, "minimum_should_match": "75%"}}
        },
        "knn": {
            "field": "embedding",
            "query_vector": get_embedding(query),
            "k": 10,
            "num_candidates": 10,
            "boost": 1,
        },
        "highlight": {
            "fields": {"alltexts": {"fragment_size": 100, "number_of_fragments": 10}},
            "pre_tags": ["<mark>"],
            "post_tags": ["</mark>"],
        },
        "rank": {
            "rrf": {"window_size": 10},
        },
        "_source": _source,
        "size": k,
    }

    search_results = es.search(index=index, body=search_query)

    return search_results["hits"]["hits"]


res = search_es_with_highlighting_and_slides(
    "Find examples of 'digital maturity model' and how to define one", k=5
)
print(res)



{
	"name": "BadRequestError",
	"message": "BadRequestError(400, 'action_request_validation_exception', 'Validation Failed: 1: [rank] cannot be used with [highlighter];')",
	"stack": "---------------------------------------------------------------------------
BadRequestError                           Traceback (most recent call last)
Cell In[11], line 41
     36     search_results = es.search(index=index, body=search_query)
     38     return search_results[\"hits\"][\"hits\"]
---> 41 res = search_es_with_highlighting_and_slides(
     42     \"Find examples of 'digital maturity model' and how to define one\", k=5
     43 )
     44 print(res)

Cell In[11], line 36, in search_es_with_highlighting_and_slides(query, k, es, index, _source)
      5 def search_es_with_highlighting_and_slides(
      6     query,
      7     k=5,
   (...)
     11 ):
     12     # Define search query with highlighting
     13     search_query = {
     14         \"query\": {
     15             \"match\": {\"alltexts\": {\"query\": query, \"minimum_should_match\": \"75%\"}}
   (...)
     33         \"size\": k,
     34     }
---> 36     search_results = es.search(index=index, body=search_query)
     38     return search_results[\"hits\"][\"hits\"]

File ~/.conda/envs/unstructured/lib/python3.11/site-packages/elasticsearch/_sync/client/utils.py:426, in _rewrite_parameters.<locals>.wrapper.<locals>.wrapped(*args, **kwargs)
    423         except KeyError:
    424             pass
--> 426 return api(*args, **kwargs)

File ~/.conda/envs/unstructured/lib/python3.11/site-packages/elasticsearch/_sync/client/__init__.py:3836, in Elasticsearch.search(self, index, aggregations, aggs, allow_no_indices, allow_partial_search_results, analyze_wildcard, analyzer, batched_reduce_size, ccs_minimize_roundtrips, collapse, default_operator, df, docvalue_fields, error_trace, expand_wildcards, explain, ext, fields, filter_path, from_, highlight, human, ignore_throttled, ignore_unavailable, indices_boost, knn, lenient, max_concurrent_shard_requests, min_compatible_shard_node, min_score, pit, post_filter, pre_filter_shard_size, preference, pretty, profile, q, query, rank, request_cache, rescore, rest_total_hits_as_int, routing, runtime_mappings, script_fields, scroll, search_after, search_type, seq_no_primary_term, size, slice, sort, source, source_excludes, source_includes, stats, stored_fields, suggest, suggest_field, suggest_mode, suggest_size, suggest_text, terminate_after, timeout, track_scores, track_total_hits, typed_keys, version, body)
   3834 if __body is not None:
   3835     __headers[\"content-type\"] = \"application/json\"
-> 3836 return self.perform_request(  # type: ignore[return-value]
   3837     \"POST\", __path, params=__query, headers=__headers, body=__body
   3838 )

File ~/.conda/envs/unstructured/lib/python3.11/site-packages/elasticsearch/_sync/client/_base.py:320, in BaseClient.perform_request(self, method, path, params, headers, body)
    317         except (ValueError, KeyError, TypeError):
    318             pass
--> 320     raise HTTP_EXCEPTIONS.get(meta.status, ApiError)(
    321         message=message, meta=meta, body=resp_body
    322     )
    324 # 'X-Elastic-Product: Elasticsearch' should be on every 2XX response.
    325 if not self._verified_elasticsearch:
    326     # If the header is set we mark the server as verified.

BadRequestError: BadRequestError(400, 'action_request_validation_exception', 'Validation Failed: 1: [rank] cannot be used with [highlighter];')"
}

type or paste code here

I removed rank and it worked; but 2 issues:
1- I need to do semantic highlighting for knn - as keyword search is not doing well in a lot of cases for my use case.
2- How hybrid search works without rank? Is there any RRF happening?

  1. I'm not sure on how you could do highlights on text with vectors. There's basically no connection between the source terms and the generated vectors.
  2. It's based on the score only.

I guess you could use a sampler aggregation on the top N results and use the "significant_text" aggregation on the text field. That could draw out some of the keywords strongly related to the vector. These keywords could then be used in a follow-up query to get highlighting.

2 Likes

Nice hack! :wink:

Thanks @Mark_Harwood1. Intresting idea but didn't work well. As an example, I tried this query:

query = "Find example case studies for clients like Chevron."

res, keywords = search_es_with_significant_text(query, k=5)

While the results set include clients like Chevron and Exxon, the significant_keywords are like:

['transitioned', 'dark', 'advertising', 'imagery', 'backend', 'sophisticated', 'showcases']

Which are not really what I like to see.

Am I missing something?

Am I missing something?

Some ideas:
Try the filter_duplicate_text setting which should help avoid being misled by duplicated noise in results.

If you have many shards and the relevant content is spread thinly across many of them it makes it much hard to detect any signal from the data.

If your query is a vector with a filter (e.g. category:caseStudies) then the significance algo may be tuning into the language of case studies in general compared to other categories of doc like category:apiDocs. Use the background_filter of sig text to set the base line to category:caseStudies so that it has a chance to tune into the Chevron/Exxon-related language rather than the general case study vernacular.

Indexing with shingles can help identify key phrases like [oil field] and [shale drilling]

Different choices of heuristic can produce different results.

However all of the above is futile if you only have a handful of truly relevant docs. There just may not be the cohesion and therefore signal strength in the results required to pull out anything meaningful/useful from the content.

1 Like

Thanks very much @Mark_Harwood1.
Will try them.

Any idea how Google highlighting works?

Not something I've studied, sorry

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.