Mapper-attachment vs Ingest-attachment with OCR

(Joshua Taylor) #1


I have been looking at using Elastic to read PDF/TIFF files and using OCR to parse them into the text so that we can use Elastic to build the search on top of it with the contents.

Research and recommendation by a consultant company suggested the use of the following mapper-attachment plugin to handle at least inputting the data into Elastic (not the OCR bit at least):

However, that page declares it is deprecated and points to this plugin instead:

I cannot find any real difference between the 2, nor can I even find a Github repo for ingest, just mapper. I don't see anything declaring mapper deprecated except on that documentation page.

So, I have several questions arising from this:

  1. Is mapper-attachment truly deprecated, and should we be looking to use ingest instead (even though I see no difference)

  2. Mapper doesn't claim to support OCR, but said that since the underlying library Tika does (with required dependencies) does, it might out of the box. Has anyone tried using either of those plugins with OCR and had it work correctly?

  3. Is there some other plugin for any part of the Elastic stack which could do this?

This is research, so for now I'm manually running the OCR on the TIFFs and then placing those in Elastic, with some success, so I can tell it's at least possible, but it feels like I'm missing information.

Thanks for the time and thoughts.

(David Pilato) #2
  1. Yes. The main difference is that ingest modifies _source document before indexation and can do that on specific ingest nodes. Mapper do that on data node and you don't see the effect in _source but only on what has been indexed in Lucene.

  2. I think some people succeeded at some point. Not sure what is the current status with the Security Manager now.
    I know that some people are using OCR (by adding Tesseract) when using FSCrawler.

  3. Not as far as I know.

(Joshua Taylor) #3

Thanks for the reply!

I had saw the Tesseract comments made on an issue on the mapper repo, so I'm right now manually using that to run the OCR to get the pdf to submit it to mapper, just was not sure if there was any correct approach or if anyone had experience with getting those plugins to handle it correctly without such scripts. For now, this works, but your FSCrawler looks interesting in that regard, as the end result would likely want something similar.

(This is all in a testing phase to even determine if it's going to be worth moving the process we have now to Elastic).

Hoped there was a more Elastic-built way, since the libraries at least have the access, but so far we're being forced into doing this research ourselves, so want to make sure I'm familiar with all the requires before trying.

I just adjusted the title to more correctly represent the usage, since it's not just comparison, but the usage of OCR as well.

(system) #4

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.