Hello.
I have been looking at using Elastic to read PDF/TIFF files and using OCR to parse them into the text so that we can use Elastic to build the search on top of it with the contents.
Research and recommendation by a consultant company suggested the use of the following mapper-attachment plugin to handle at least inputting the data into Elastic (not the OCR bit at least):
https://www.elastic.co/guide/en/elasticsearch/plugins/current/mapper-attachments.html
However, that page declares it is deprecated and points to this plugin instead:
https://www.elastic.co/guide/en/elasticsearch/plugins/current/ingest-attachment.html
I cannot find any real difference between the 2, nor can I even find a Github repo for ingest, just mapper. I don't see anything declaring mapper deprecated except on that documentation page.
So, I have several questions arising from this:
-
Is mapper-attachment truly deprecated, and should we be looking to use ingest instead (even though I see no difference)
-
Mapper doesn't claim to support OCR, but said that since the underlying library Tika does (with required dependencies) does, it might out of the box. Has anyone tried using either of those plugins with OCR and had it work correctly?
-
Is there some other plugin for any part of the Elastic stack which could do this?
This is research, so for now I'm manually running the OCR on the TIFFs and then placing those in Elastic, with some success, so I can tell it's at least possible, but it feels like I'm missing information.
Thanks for the time and thoughts.