Image (.TIF) is supported by Ingest Attachment plugin? (OCR for images)

I've ES 7.3.1 with Ingest Attachment plugin, I can extract text for PDF, txt, docx, xlsx, however I'm dealing as well with .TIF documents, does the plugin support image to extract text (OCR) ?.

I was reading few web sites that mentioned just to install 'Tesseract' in the same server where ES is installed and the text will be extracted from the image using the plugin, I can't find anything directly on elastic site to confirm OCR with attachment ingestion or some 'extra' configuration to make that happen. I got installed tesseract v4.0 on Windows Server 2016. Doing a direct tesseract command line for the same file I got text extracted.

Processing an image as input (base64), here my output from ES pipeline

 "_source" : {
    "attachment" : {
      "content_type" : "image/tiff",
      "content_length" : 0
    }
  } 

(decoding from base64 to TIF, the file is a valid image)

I found tesseractPath setting in tesseractOCRconfig.properties file from tika-parser-1.19.1 under attachment plugin folder in my ES install, I set the path to my tesseract install but still not able to process TIF, if I leave empty is suppose to use my windows environment variable by default - same result in both scenarios.

Does anyone got a TIF file working with Ingest-Attachment ?

I've been playing with FSCrawler but I would like explore more Attachment-Plugin (in specific for TIF) to compare both products.

any ideas/suggestion will be welcome.

No it's not supported AFAIK as it would run an external process (Tesseract) which is I think not supported by the security manager.

thanks for the info David, is a shame image docs are not supported by Ingest Attachment without the need to custom code from your experience, I'll wait if someone else have experienced the same issue, maybe I'll ended up building a custom ES plugin.

regards

You can fork the ingest attachment plugin and modify it for your needs (ie add more permissions for the security manager - the ones needed to run an external process like Tesseract)

1 Like

@dadoonet do you know the class that handle the security manager ?

About my initial question (if ocr/TIF is supported), I found the parsers supported by the plugin on ingest-attachment/src/main/java/org/elasticsearch/ingest/attachment/TikaImpl.java.
ocr, image or jpeg are not part.

 /** subset of parsers for types we support */
    private static final Parser PARSERS[] = new Parser[] {
        // documents
        new org.apache.tika.parser.html.HtmlParser(),
        new org.apache.tika.parser.rtf.RTFParser(),
        new org.apache.tika.parser.pdf.PDFParser(),
        new org.apache.tika.parser.txt.TXTParser(),
        new org.apache.tika.parser.microsoft.OfficeParser(),
        new org.apache.tika.parser.microsoft.OldExcelParser(),
        ParserDecorator.withoutTypes(new org.apache.tika.parser.microsoft.ooxml.OOXMLParser(), EXCLUDES),
        new org.apache.tika.parser.odf.OpenDocumentParser(),
        new org.apache.tika.parser.iwork.IWorkPackageParser(),
        new org.apache.tika.parser.xml.DcXMLParser(),
        new org.apache.tika.parser.epub.EpubParser(),
    };

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.