Highlights not matching between annotated_text and other fields

Executing a query_string search on an index the returned matched highlights sometimes don't really match (this is needed to get the annotated metadata). From what I've seen there is not much support for this kind of features, but I guess asking doesn't hurt.

Starting from the result:

"highlight" : {
  "annotatedText" : [
    "[Grazie](976200) [Presidente](_hit_term=annotatedText%3Apresi*&976200) [buongiorno](_hit_term=buongiorno&976200) [a](976200) [tutte](976200) [buongiorno](_hit_term=buongiorno&978720) [a](978720) [tutti.](978720)",
    "[Grazie](3357160) [Presidente](_hit_term=annotatedText%3Apresi*&3357160) [e](3357160) [buongiorno](_hit_term=buongiorno&3357160) [colleghi.](3357160)",
    "[Grazie](3893600) [Presidente](_hit_term=annotatedText%3Apresi*&3893600) [buongiorno](_hit_term=buongiorno&3893600) [all'aula](3896880) [e](3896880)",
    "[Grazie](4248920) [presidente](_hit_term=annotatedText%3Apresi*&4248920) [e](4248920) [un](4248920) [buongiorno](_hit_term=buongiorno&4248920) [colleghe](4248920) [e](4248920)",
    "[Grazie](6168080) [presidente](_hit_term=annotatedText%3Apresi*). [Buongiorno](_hit_term=buongiorno&6169280) [a](6169280) [tutti.](6169280)"
  ],
  "text" : [
    "<em>Buongiorno</em> <em>buongiorno</em> a tutti possiamo iniziare chiedo al consigliere segretario Jordan di procedere all'appello. ",
    "Grazie Presidente <em>buongiorno</em> a tutte <em>buongiorno</em> a tutti.",
    "Grazie Presidente e <em>buongiorno</em> colleghi. Bisogna evidenziare che in questi ultimi anni la situazione",
    "Consigliere Cretier ne ha facoltà. Grazie Presidente <em>buongiorno</em> all'aula e ai colleghi consiglieri e alla giunta compresa. ",
    "Grazie presidente e un <em>buongiorno</em> colleghe e colleghi. "
  ]
}

The first annotated text highlight matches the second text highlight, the second annotated matches the third text, and so on until the fifth doesn't really match anything.
Other results are correct and others diverge at some other point.

The POST _search is like this:

{
  "query": {
    "query_string": {
      "query": "presi* buongiorno",
      "fields": ["textItalian","text","annotatedText"],
      "default_operator": "AND",
      "type": "phrase",
      "analyze_wildcard": true
    }
  },
  "_source": {
    "excludes": ["text","annotatedText","textItalian"]
  },
  "highlight": {
    "fields": {
      "text": {},
      "annotatedText": { "type":"annotated", "matched_fields": ["text","textItalian"] }
    }
    ,"type": "fvh"
    ,"boundary_scanner": "sentence"
  }
  ,"sort": [{"referenceDate": { "order": "asc" }}]
}

And the index is like this:

{
  "mappings" : {
    "properties" : {
      "$type" : {
        "type" : "text"
      },
      "annotatedText" : {
        "type" : "annotated_text",
        "term_vector" : "with_positions_offsets"
      },
      "contentId" : {
        "type" : "long"
      },
      "id" : {
        "type" : "text",
        "fields" : {
          "keyword" : { "type" : "keyword", "ignore_above" : 256 }
        }
      },
      "inserted" : {
        "type" : "date"
      },
      "language" : {
        "type" : "text",
        "fields" : {
          "keyword" : { "type" : "keyword", "ignore_above" : 256 }
        }
      },
      "referenceDate" : {
        "type" : "date"
      },
      "text" : {
        "type" : "text",
        "term_vector" : "with_positions_offsets",
        "index_prefixes" : { "min_chars" : 2, "max_chars" : 5 }
      },
      "textItalian" : {
        "type" : "text",
        "term_vector" : "with_positions_offsets",
        "analyzer" : "italian",
        "index_prefixes" : { "min_chars" : 2, "max_chars" : 5 }
      },
      "trackId" : {
        "type" : "long"
      }
    }
  }
}

EDIT
Apparently "require_field_match":false helps a little.
I'm using Elasticsearch 7.14.

Hi evoluc

I'm not sure I understand the problem. All 5 highlighted snippets include a hit on the term presidente so I don't get the comment "the fifth doesn't really match anything"

All the fields contain the same text, so by using matched_fields I sort of expected that the highlights in annotatedText should correspond to the highlights of the other field(s).

The text and and the annotated_text field in your request use different highlighter implementations which will behave differently. I see, for example, that the plain text field has not highlighted any of the presi* search terms.
The truth of it is there are various different highlighter implementations written by various authors over the years, who have each set out to improve on all the other ones. Each make their own trade-offs and behave with subtle differences.

Oh, yeah, the specific type:annotated on the field overwrites the global type:fvh, so matched_fields is ignored, I was kinda oblivious to that, now I saw it.
I got to that because it's not possible to extract the offset of the highlight, so I tried to execute a query and then match the offset internally on another field.
I don't know, I will try maybe another way, like querying two times annotatedText one with the italian analyzer and one with the standard analyzer.
Thanks anyway.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.