I've ES 7.3.1 with Ingest Attachment plugin, I can extract text for PDF, txt, docx, xlsx, however I'm dealing as well with .TIF documents, does the plugin support image to extract text (OCR) ?.
I was reading few web sites that mentioned just to install 'Tesseract' in the same server where ES is installed and the text will be extracted from the image using the plugin, I can't find anything directly on elastic site to confirm OCR with attachment ingestion or some 'extra' configuration to make that happen. I got installed tesseract v4.0 on Windows Server 2016. Doing a direct tesseract command line for the same file I got text extracted.
Processing an image as input (base64), here my output from ES pipeline
(decoding from base64 to TIF, the file is a valid image)
I found tesseractPath setting in tesseractOCRconfig.properties file from tika-parser-1.19.1 under attachment plugin folder in my ES install, I set the path to my tesseract install but still not able to process TIF, if I leave empty is suppose to use my windows environment variable by default - same result in both scenarios.
Does anyone got a TIF file working with Ingest-Attachment ?
I've been playing with FSCrawler but I would like explore more Attachment-Plugin (in specific for TIF) to compare both products.
thanks for the info David, is a shame image docs are not supported by Ingest Attachment without the need to custom code from your experience, I'll wait if someone else have experienced the same issue, maybe I'll ended up building a custom ES plugin.
You can fork the ingest attachment plugin and modify it for your needs (ie add more permissions for the security manager - the ones needed to run an external process like Tesseract)
@dadoonet do you know the class that handle the security manager ?
About my initial question (if ocr/TIF is supported), I found the parsers supported by the plugin on ingest-attachment/src/main/java/org/elasticsearch/ingest/attachment/TikaImpl.java.
ocr, image or jpeg are not part.
/** subset of parsers for types we support */
private static final Parser PARSERS[] = new Parser[] {
// documents
new org.apache.tika.parser.html.HtmlParser(),
new org.apache.tika.parser.rtf.RTFParser(),
new org.apache.tika.parser.pdf.PDFParser(),
new org.apache.tika.parser.txt.TXTParser(),
new org.apache.tika.parser.microsoft.OfficeParser(),
new org.apache.tika.parser.microsoft.OldExcelParser(),
ParserDecorator.withoutTypes(new org.apache.tika.parser.microsoft.ooxml.OOXMLParser(), EXCLUDES),
new org.apache.tika.parser.odf.OpenDocumentParser(),
new org.apache.tika.parser.iwork.IWorkPackageParser(),
new org.apache.tika.parser.xml.DcXMLParser(),
new org.apache.tika.parser.epub.EpubParser(),
};
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.