We use an OCR software to extract the fulltext from scanned images in Alto or hOCR format. These formats basically know the coordinates of each word on the scanned image.
When doing a search against the fulltext, a requirement is to provide the user with the image on the highlighted text on the image using the information.
For Solr there exists a plug-in that provides a highlighter that returns all the information. Example can be found on Solr OCR Highlighting Plugin (dbmdz.github.io)
From this page is the following example.
{
"text": "to those parts, subject to unreasonable claims from the proprietor "
"of Maryland, until the year 17C2, when the whole controversy was "
"settled by Charles <em>Mason and Jeremiah Dixon</em>, upon their "
"return from an observation of the transit of Venus, at the Cape of "
"Good Hope, where they",
"score": 5555104.5,
"pages": [
{ "id": "page_380", "width": 1436, "height": 2427 }
],
"regions": [
{ "ulx": 196, "uly": 1703, "lrx": 1232, "lry": 1968, "pageIdx": 0 }
],
"highlights":[
[{ "text": "Mason and Jeremiah", "ulx": 675, "uly": 110, "lrx": 1036, "lry": 145,
"parentRegionIdx": 0},
{ "text": "Dixon,", "ulx": 1, "uly": 167, "lrx": 119, "lry": 204,
"parentRegionIdx": 0 }]
]
}
Does anybody know if a similar plugin exists for Elasticsearch? Or if not, how the same is achieved in Elasticsearch?
Thanks
J