OCR highlighter for Alto or hOCR formats

evelix · May 5, 2022, 3:20pm

We use an OCR software to extract the fulltext from scanned images in Alto or hOCR format. These formats basically know the coordinates of each word on the scanned image.
When doing a search against the fulltext, a requirement is to provide the user with the image on the highlighted text on the image using the information.

For Solr there exists a plug-in that provides a highlighter that returns all the information. Example can be found on Solr OCR Highlighting Plugin (dbmdz.github.io)

From this page is the following example.

{
  "text": "to those parts, subject to unreasonable claims from the proprietor "
          "of Maryland, until the year 17C2, when the whole controversy was "
          "settled by Charles <em>Mason and Jeremiah Dixon</em>, upon their "
          "return from an observation of the transit of Venus, at the Cape of "
          "Good Hope, where they",
  "score": 5555104.5,
  "pages": [
    { "id": "page_380", "width":  1436, "height":  2427 }
  ],
  "regions": [
    { "ulx": 196, "uly": 1703, "lrx": 1232, "lry": 1968, "pageIdx": 0 }
  ],
  "highlights":[
    [{ "text": "Mason and Jeremiah", "ulx": 675, "uly": 110, "lrx": 1036, "lry": 145,
       "parentRegionIdx": 0},
     { "text": "Dixon,", "ulx": 1, "uly": 167, "lrx": 119, "lry": 204,
       "parentRegionIdx": 0 }]
  ]
}

Does anybody know if a similar plugin exists for Elasticsearch? Or if not, how the same is achieved in Elasticsearch?

Thanks
J

dadoonet · May 6, 2022, 4:49am

I don't think there's any feature like this in Elasticsearch. Wondering if #enterprise-search has a similar feature in the making.

system · June 3, 2022, 4:49am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Sourceless Highlighting Elasticsearch	8	354	July 6, 2017
Search within PDF files Elasticsearch	9	5600	August 26, 2017
Highlight content from crawl data from manifoldcf to ES Elasticsearch	5	636	July 6, 2017
Search-hints highlighting in PDFA`s and PDF's with Tiff overlay Elasticsearch	2	571	August 7, 2019
Regarding Elastic Search Highlighter Elasticsearch	2	316	July 6, 2017

OCR highlighter for Alto or hOCR formats

Related topics