Configure attachment mapper to use OCR plugin


(Waheed Abualrous) #1

I installed elastic search with the attachment mapper then installed tesseract OCR on the same machine, my goal is to be able to index images through elastic search.

currently I'm able to parse and index Microsoft office files with elastic but not images as in someway elastic needs to know that tesseract is installed on the machine and pass the image to it to extract the text.

tesseract installation is fine as i can use it as standalone, any help making it work with elastic?

Thank you.


(David Pilato) #2

Well. I never tried it although I guess this should work.
I'd be interested by your findings and by a PR to document it :stuck_out_tongue:

Note that we may reduce a lot the number of dependencies in the future and may be you'll have to add by yourself some libs to perform OCR. See https://github.com/elastic/elasticsearch-mapper-attachments/issues/163


#3

@Waheed_Abualrous
Did you ever manage to get tesseract working in conjunction with ES?
Could you share your solution, if so.
Thanks.


(Waheed Abualrous) #4

I havn't found the solution myself but the team member who was able to fix it said that the OCR plugin and attachment mapper work with each other out of the box and in our case it was a permission issue, the use running elastic did not have permission to the OCR libraries then we moved everything to be under the same folder and gave permission to the use to that folder and it worked.

hope this helps anyone.


(David Pilato) #5

Interesting. I'm wondering BTW if it is still working with 2.1.1?

Also, do you want to document it and send a PR?


(Waheed Abualrous) #6

I will check with my team mate and if he is confident about the solution I will let him do that,


(Subhadip Ghoshal) #7

Hi, this is Subhadip, Waheed's teammate.

I was able to find a work around for ES 1.7.2 with elastic mapper attachments 2.7.1 as recommended.
Our original installation was in /usr/local/share directory which falls under superuser access. Although, mapper-attachments, and hence Tika worked absolutely fine with it, tesseract didn't work. I tried to track the logs but I couldn't really find anything promising.

So just for kicks, I tried the installation in the local user directory, and Voila! It worked out of the box without anything on my part. I had to just make sure the paths were configured properly.

So I assumed at that point, that the issue was something to do with superuser privileges. Even when I ran elasticsearch with su privileges, it still couldn't interact with tesseract-ocr.

However, that's the story for ES 1.7. For ES 2.0 with mapper-attachments 3.0.2, I wasn't able to figure out how to make it work at all. Even though Tika worked fine as was evident by the fact that it could index Text, word, pdf files etc., OCR didn't work. We gave up on that, at that point, because of lack of time, and just ran with ES 1.7.2.

I can't say I have dug deeper than that. So anything would be appreciated, as that is a problem that we have still not been able to solve.

As for a PR (I assume you meant a pull request?), I am not sure that would be helpful since I have only just skimmed over the source code, and haven't really tried any of my own customizations.

Hope that helps! And please keep me posted.

Thanks


Unable to Find Document, Searching Contents - Mapper Attachments Plugin
(David Pilato) #8

May be it's because we added more security in 2.0 and the security manager doesn't allow Tesseract lib to speak with another process (I don't know how Tesseract works though).

That being said, in elasticsearch master branch, we have reduced a lot the number of dependencies which are included with the mapper attachment plugin to:

  • limit the final artifact size
  • avoid all Jar Hell issues and dependencies conflicts
  • only provide what we can really test

So I think that even if you can make it work in 2.x series, it won't be possible anymore in next series unless you add manually Tesseract and all its deps manually in the plugin dir.

Note that OCR is an heavy task which should not impact index process. That's one of the reason it has never been adviced to use OCR.
Instead, we will probably port the mapper attachment plugin as a an Ingest Node plugin and then we could reintroduce hopefully OCR support.

I opened #16303 BTW for this.


(system) #9