Executing a query_string search on an index the returned matched highlights sometimes don't really match (this is needed to get the annotated metadata). From what I've seen there is not much support for this kind of features, but I guess asking doesn't hurt.
Starting from the result:
"highlight" : {
"annotatedText" : [
"[Grazie](976200) [Presidente](_hit_term=annotatedText%3Apresi*&976200) [buongiorno](_hit_term=buongiorno&976200) [a](976200) [tutte](976200) [buongiorno](_hit_term=buongiorno&978720) [a](978720) [tutti.](978720)",
"[Grazie](3357160) [Presidente](_hit_term=annotatedText%3Apresi*&3357160) [e](3357160) [buongiorno](_hit_term=buongiorno&3357160) [colleghi.](3357160)",
"[Grazie](3893600) [Presidente](_hit_term=annotatedText%3Apresi*&3893600) [buongiorno](_hit_term=buongiorno&3893600) [all'aula](3896880) [e](3896880)",
"[Grazie](4248920) [presidente](_hit_term=annotatedText%3Apresi*&4248920) [e](4248920) [un](4248920) [buongiorno](_hit_term=buongiorno&4248920) [colleghe](4248920) [e](4248920)",
"[Grazie](6168080) [presidente](_hit_term=annotatedText%3Apresi*). [Buongiorno](_hit_term=buongiorno&6169280) [a](6169280) [tutti.](6169280)"
],
"text" : [
"<em>Buongiorno</em> <em>buongiorno</em> a tutti possiamo iniziare chiedo al consigliere segretario Jordan di procedere all'appello. ",
"Grazie Presidente <em>buongiorno</em> a tutte <em>buongiorno</em> a tutti.",
"Grazie Presidente e <em>buongiorno</em> colleghi. Bisogna evidenziare che in questi ultimi anni la situazione",
"Consigliere Cretier ne ha facoltà. Grazie Presidente <em>buongiorno</em> all'aula e ai colleghi consiglieri e alla giunta compresa. ",
"Grazie presidente e un <em>buongiorno</em> colleghe e colleghi. "
]
}
The first annotated text highlight matches the second text highlight, the second annotated matches the third text, and so on until the fifth doesn't really match anything.
Other results are correct and others diverge at some other point.
The POST _search is like this:
{
"query": {
"query_string": {
"query": "presi* buongiorno",
"fields": ["textItalian","text","annotatedText"],
"default_operator": "AND",
"type": "phrase",
"analyze_wildcard": true
}
},
"_source": {
"excludes": ["text","annotatedText","textItalian"]
},
"highlight": {
"fields": {
"text": {},
"annotatedText": { "type":"annotated", "matched_fields": ["text","textItalian"] }
}
,"type": "fvh"
,"boundary_scanner": "sentence"
}
,"sort": [{"referenceDate": { "order": "asc" }}]
}
And the index is like this:
{
"mappings" : {
"properties" : {
"$type" : {
"type" : "text"
},
"annotatedText" : {
"type" : "annotated_text",
"term_vector" : "with_positions_offsets"
},
"contentId" : {
"type" : "long"
},
"id" : {
"type" : "text",
"fields" : {
"keyword" : { "type" : "keyword", "ignore_above" : 256 }
}
},
"inserted" : {
"type" : "date"
},
"language" : {
"type" : "text",
"fields" : {
"keyword" : { "type" : "keyword", "ignore_above" : 256 }
}
},
"referenceDate" : {
"type" : "date"
},
"text" : {
"type" : "text",
"term_vector" : "with_positions_offsets",
"index_prefixes" : { "min_chars" : 2, "max_chars" : 5 }
},
"textItalian" : {
"type" : "text",
"term_vector" : "with_positions_offsets",
"analyzer" : "italian",
"index_prefixes" : { "min_chars" : 2, "max_chars" : 5 }
},
"trackId" : {
"type" : "long"
}
}
}
}
EDIT
Apparently "require_field_match":false
helps a little.
I'm using Elasticsearch 7.14.