Mapper-attachment vs Ingest-attachment with OCR

JTaylor · November 15, 2016, 4:34pm

Hello.

I have been looking at using Elastic to read PDF/TIFF files and using OCR to parse them into the text so that we can use Elastic to build the search on top of it with the contents.

Research and recommendation by a consultant company suggested the use of the following mapper-attachment plugin to handle at least inputting the data into Elastic (not the OCR bit at least):
https://www.elastic.co/guide/en/elasticsearch/plugins/current/mapper-attachments.html

However, that page declares it is deprecated and points to this plugin instead:
https://www.elastic.co/guide/en/elasticsearch/plugins/current/ingest-attachment.html

I cannot find any real difference between the 2, nor can I even find a Github repo for ingest, just mapper. I don't see anything declaring mapper deprecated except on that documentation page.

So, I have several questions arising from this:

Is mapper-attachment truly deprecated, and should we be looking to use ingest instead (even though I see no difference)
Mapper doesn't claim to support OCR, but said that since the underlying library Tika does (with required dependencies) does, it might out of the box. Has anyone tried using either of those plugins with OCR and had it work correctly?
Is there some other plugin for any part of the Elastic stack which could do this?

This is research, so for now I'm manually running the OCR on the TIFFs and then placing those in Elastic, with some success, so I can tell it's at least possible, but it feels like I'm missing information.

Thanks for the time and thoughts.

dadoonet · November 15, 2016, 4:47pm

Yes. The main difference is that ingest modifies _source document before indexation and can do that on specific ingest nodes. Mapper do that on data node and you don't see the effect in _source but only on what has been indexed in Lucene.
I think some people succeeded at some point. Not sure what is the current status with the Security Manager now.
I know that some people are using OCR (by adding Tesseract) when using FSCrawler.
Not as far as I know.

JTaylor · November 15, 2016, 5:15pm

Thanks for the reply!

I had saw the Tesseract comments made on an issue on the mapper repo, so I'm right now manually using that to run the OCR to get the pdf to submit it to mapper, just was not sure if there was any correct approach or if anyone had experience with getting those plugins to handle it correctly without such scripts. For now, this works, but your FSCrawler looks interesting in that regard, as the end result would likely want something similar.

(This is all in a testing phase to even determine if it's going to be worth moving the process we have now to Elastic).

Hoped there was a more Elastic-built way, since the libraries at least have the access, but so far we're being forced into doing this research ourselves, so want to make sure I'm familiar with all the requires before trying.

I just adjusted the title to more correctly represent the usage, since it's not just comparison, but the usage of OCR as well.

system · December 13, 2016, 5:16pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Can any one knows how to enable OCR in es with Ingest mapper plugin? Elasticsearch elastic-stack-alerting	4	403	June 1, 2020
Mapper attachment plugin vs. pre-parsing and extracting content from binary files Elasticsearch	12	1643	March 6, 2017
Is it necessary to use Ingest Attachment Processor to index pdf files Elasticsearch	28	2355	November 9, 2018
Image (.TIF) is supported by Ingest Attachment plugin? (OCR for images) Elasticsearch	5	1359	September 3, 2020
How to use OCR in Elasticsearch ingest attachment plugin? Elasticsearch ingest-pipeline	12	6006	March 4, 2021

Mapper-attachment vs Ingest-attachment with OCR

Related topics