OCR highlighter for Alto or hOCR formats

We use an OCR software to extract the fulltext from scanned images in Alto or hOCR format. These formats basically know the coordinates of each word on the scanned image.
When doing a search against the fulltext, a requirement is to provide the user with the image on the highlighted text on the image using the information.

For Solr there exists a plug-in that provides a highlighter that returns all the information. Example can be found on Solr OCR Highlighting Plugin (dbmdz.github.io)

From this page is the following example.

  "text": "to those parts, subject to unreasonable claims from the pro­prietor "
          "of Maryland, until the year 17C2, when the whole controversy was "
          "settled by Charles <em>Mason and Jeremiah Dixon</em>, upon their "
          "return from an observation of the tran­sit of Venus, at the Cape of "
          "Good Hope, where they",
  "score": 5555104.5,
  "pages": [
    { "id": "page_380", "width":  1436, "height":  2427 }
  "regions": [
    { "ulx": 196, "uly": 1703, "lrx": 1232, "lry": 1968, "pageIdx": 0 }
    [{ "text": "Mason and Jeremiah", "ulx": 675, "uly": 110, "lrx": 1036, "lry": 145,
       "parentRegionIdx": 0},
     { "text": "Dixon,", "ulx": 1, "uly": 167, "lrx": 119, "lry": 204,
       "parentRegionIdx": 0 }]

Does anybody know if a similar plugin exists for Elasticsearch? Or if not, how the same is achieved in Elasticsearch?


I don't think there's any feature like this in Elasticsearch. Wondering if #enterprise-search has a similar feature in the making.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.