Highlights not matching between annotated_text and other fields

evoluc · November 23, 2021, 10:20am

Executing a query_string search on an index the returned matched highlights sometimes don't really match (this is needed to get the annotated metadata). From what I've seen there is not much support for this kind of features, but I guess asking doesn't hurt.

Starting from the result:

"highlight" : {
  "annotatedText" : [
    "[Grazie](976200) [Presidente](_hit_term=annotatedText%3Apresi*&976200) [buongiorno](_hit_term=buongiorno&976200) [a](976200) [tutte](976200) [buongiorno](_hit_term=buongiorno&978720) [a](978720) [tutti.](978720)",
    "[Grazie](3357160) [Presidente](_hit_term=annotatedText%3Apresi*&3357160) [e](3357160) [buongiorno](_hit_term=buongiorno&3357160) [colleghi.](3357160)",
    "[Grazie](3893600) [Presidente](_hit_term=annotatedText%3Apresi*&3893600) [buongiorno](_hit_term=buongiorno&3893600) [all'aula](3896880) [e](3896880)",
    "[Grazie](4248920) [presidente](_hit_term=annotatedText%3Apresi*&4248920) [e](4248920) [un](4248920) [buongiorno](_hit_term=buongiorno&4248920) [colleghe](4248920) [e](4248920)",
    "[Grazie](6168080) [presidente](_hit_term=annotatedText%3Apresi*). [Buongiorno](_hit_term=buongiorno&6169280) [a](6169280) [tutti.](6169280)"
  ],
  "text" : [
    "<em>Buongiorno</em> <em>buongiorno</em> a tutti possiamo iniziare chiedo al consigliere segretario Jordan di procedere all'appello. ",
    "Grazie Presidente <em>buongiorno</em> a tutte <em>buongiorno</em> a tutti.",
    "Grazie Presidente e <em>buongiorno</em> colleghi. Bisogna evidenziare che in questi ultimi anni la situazione",
    "Consigliere Cretier ne ha facoltà. Grazie Presidente <em>buongiorno</em> all'aula e ai colleghi consiglieri e alla giunta compresa. ",
    "Grazie presidente e un <em>buongiorno</em> colleghe e colleghi. "
  ]
}

The first annotated text highlight matches the second text highlight, the second annotated matches the third text, and so on until the fifth doesn't really match anything.
Other results are correct and others diverge at some other point.

The POST _search is like this:

{
  "query": {
    "query_string": {
      "query": "presi* buongiorno",
      "fields": ["textItalian","text","annotatedText"],
      "default_operator": "AND",
      "type": "phrase",
      "analyze_wildcard": true
    }
  },
  "_source": {
    "excludes": ["text","annotatedText","textItalian"]
  },
  "highlight": {
    "fields": {
      "text": {},
      "annotatedText": { "type":"annotated", "matched_fields": ["text","textItalian"] }
    }
    ,"type": "fvh"
    ,"boundary_scanner": "sentence"
  }
  ,"sort": [{"referenceDate": { "order": "asc" }}]
}

And the index is like this:

{
  "mappings" : {
    "properties" : {
      "$type" : {
        "type" : "text"
      },
      "annotatedText" : {
        "type" : "annotated_text",
        "term_vector" : "with_positions_offsets"
      },
      "contentId" : {
        "type" : "long"
      },
      "id" : {
        "type" : "text",
        "fields" : {
          "keyword" : { "type" : "keyword", "ignore_above" : 256 }
        }
      },
      "inserted" : {
        "type" : "date"
      },
      "language" : {
        "type" : "text",
        "fields" : {
          "keyword" : { "type" : "keyword", "ignore_above" : 256 }
        }
      },
      "referenceDate" : {
        "type" : "date"
      },
      "text" : {
        "type" : "text",
        "term_vector" : "with_positions_offsets",
        "index_prefixes" : { "min_chars" : 2, "max_chars" : 5 }
      },
      "textItalian" : {
        "type" : "text",
        "term_vector" : "with_positions_offsets",
        "analyzer" : "italian",
        "index_prefixes" : { "min_chars" : 2, "max_chars" : 5 }
      },
      "trackId" : {
        "type" : "long"
      }
    }
  }
}

EDIT
Apparently "require_field_match":false helps a little.
I'm using Elasticsearch 7.14.

Mark_Harwood · November 23, 2021, 11:27am

Hi evoluc

I'm not sure I understand the problem. All 5 highlighted snippets include a hit on the term presidente so I don't get the comment "the fifth doesn't really match anything"

evoluc · November 23, 2021, 11:39am

All the fields contain the same text, so by using matched_fields I sort of expected that the highlights in annotatedText should correspond to the highlights of the other field(s).

Mark_Harwood · November 23, 2021, 11:45am

The text and and the annotated_text field in your request use different highlighter implementations which will behave differently. I see, for example, that the plain text field has not highlighted any of the presi* search terms.
The truth of it is there are various different highlighter implementations written by various authors over the years, who have each set out to improve on all the other ones. Each make their own trade-offs and behave with subtle differences.

evoluc · November 23, 2021, 3:26pm

Oh, yeah, the specific type:annotated on the field overwrites the global type:fvh, so matched_fields is ignored, I was kinda oblivious to that, now I saw it.
I got to that because it's not possible to extract the offset of the highlight, so I tried to execute a query and then match the offset internally on another field.
I don't know, I will try maybe another way, like querying two times annotatedText one with the italian analyzer and one with the standard analyzer.
Thanks anyway.

system · December 21, 2021, 3:26pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Annotated text highlighter identifies incorrect phrase hit when wrapped in function_score query Elasticsearch	1	296	July 23, 2021
Annotated text highlighting not working for wildcard fields Elasticsearch	3	714	November 1, 2019
Elastic search order of the highlighted fields not matching with the ranking Elasticsearch	1	162	September 26, 2023
How to know which field (fields) is matched and do the custom highlighting? Elasticsearch	1	423	July 6, 2017
Get only matched fields in highlight Elasticsearch	1	280	July 6, 2017

Highlights not matching between annotated_text and other fields

Related topics