Image (.TIF) is supported by Ingest Attachment plugin? (OCR for images)

christiancj · August 4, 2020, 7:51pm

I've ES 7.3.1 with Ingest Attachment plugin, I can extract text for PDF, txt, docx, xlsx, however I'm dealing as well with .TIF documents, does the plugin support image to extract text (OCR) ?.

I was reading few web sites that mentioned just to install 'Tesseract' in the same server where ES is installed and the text will be extracted from the image using the plugin, I can't find anything directly on elastic site to confirm OCR with attachment ingestion or some 'extra' configuration to make that happen. I got installed tesseract v4.0 on Windows Server 2016. Doing a direct tesseract command line for the same file I got text extracted.

Processing an image as input (base64), here my output from ES pipeline

 "_source" : {
    "attachment" : {
      "content_type" : "image/tiff",
      "content_length" : 0
    }
  }

(decoding from base64 to TIF, the file is a valid image)

I found tesseractPath setting in tesseractOCRconfig.properties file from tika-parser-1.19.1 under attachment plugin folder in my ES install, I set the path to my tesseract install but still not able to process TIF, if I leave empty is suppose to use my windows environment variable by default - same result in both scenarios.

Does anyone got a TIF file working with Ingest-Attachment ?

I've been playing with FSCrawler but I would like explore more Attachment-Plugin (in specific for TIF) to compare both products.

any ideas/suggestion will be welcome.

dadoonet · August 4, 2020, 8:38pm

No it's not supported AFAIK as it would run an external process (Tesseract) which is I think not supported by the security manager.

christiancj · August 5, 2020, 6:43pm

thanks for the info David, is a shame image docs are not supported by Ingest Attachment without the need to custom code from your experience, I'll wait if someone else have experienced the same issue, maybe I'll ended up building a custom ES plugin.

regards

dadoonet · August 6, 2020, 7:25am

You can fork the ingest attachment plugin and modify it for your needs (ie add more permissions for the security manager - the ones needed to run an external process like Tesseract)

christiancj · August 6, 2020, 3:33pm

@dadoonet do you know the class that handle the security manager ?

About my initial question (if ocr/TIF is supported), I found the parsers supported by the plugin on ingest-attachment/src/main/java/org/elasticsearch/ingest/attachment/TikaImpl.java.
ocr, image or jpeg are not part.

 /** subset of parsers for types we support */
    private static final Parser PARSERS[] = new Parser[] {
        // documents
        new org.apache.tika.parser.html.HtmlParser(),
        new org.apache.tika.parser.rtf.RTFParser(),
        new org.apache.tika.parser.pdf.PDFParser(),
        new org.apache.tika.parser.txt.TXTParser(),
        new org.apache.tika.parser.microsoft.OfficeParser(),
        new org.apache.tika.parser.microsoft.OldExcelParser(),
        ParserDecorator.withoutTypes(new org.apache.tika.parser.microsoft.ooxml.OOXMLParser(), EXCLUDES),
        new org.apache.tika.parser.odf.OpenDocumentParser(),
        new org.apache.tika.parser.iwork.IWorkPackageParser(),
        new org.apache.tika.parser.xml.DcXMLParser(),
        new org.apache.tika.parser.epub.EpubParser(),
    };

system · September 3, 2020, 3:34pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Can any one knows how to enable OCR in es with Ingest mapper plugin? Elasticsearch elastic-stack-alerting	4	403	June 1, 2020
Ingest-Attachment: Enabling OCR Elasticsearch	2	1289	October 12, 2020
Index image files with OCR Elasticsearch	3	2638	April 29, 2017
Mapper-attachment vs Ingest-attachment with OCR Elasticsearch	3	1764	December 13, 2016
How to use OCR in Elasticsearch ingest attachment plugin? Elasticsearch ingest-pipeline	12	6001	March 4, 2021

Image (.TIF) is supported by Ingest Attachment plugin? (OCR for images)

Related topics