We are having OCRed text of PDFs searchable in Elasticsearch.
Also we have stored original PDFs, passing highlighted terms from Elasticsearch in URL and with custom library we highlight the words in PDFs.
But we would need to know offsets (positions) of highlighted words directly from Elasticsearch to support more fancy queries (proximity search).
Not all matches of given phrase are highlighted, just those which fulfill distance condition:
Example:
# index & document creation
PUT dominik_test_search/_doc/testing
{
"content":{
"DOC_TEXT":""" Property damage covered under this insurance shall mean physical damage to the substance of property.
Physical damage to the substance of property shall not include corruption to data or software, in particular any
detrimental change in data, software or computer programs that is caused by a deletion, a corruption or a
deformation of the original structure.
Property bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla damage bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla include bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla bla data"""
}
}
# proxemity query
GET dominik_test_search/_search
{
"query": {
"query_string": {
"default_field": "content.DOC_TEXT",
"query": "\"property damage include data\"~10"
}
},
"highlight": {
"fields": {
"content.DOC_TEXT": {
"highlight_query": {
"query_string": {
"fields": [
"content.DOC_TEXT"
],
"query": "\"property damage include data\"~10"
}
},
"type": "unified",
"boundary_scanner": "sentence",
"fragment_size": 1000,
"number_of_fragments": 1,
"no_match_size": 1000,
"fragmenter": "span"
}
}
}
}
Result:
Question:
Can Elasticsearch (by any way) provides offsets of highlighted words?
(we are using ES 7.16.2 in our clusters)
Thanks in advance