Configure attachment mapper to use OCR plugin

Hi, this is Subhadip, Waheed's teammate.

I was able to find a work around for ES 1.7.2 with elastic mapper attachments 2.7.1 as recommended.
Our original installation was in /usr/local/share directory which falls under superuser access. Although, mapper-attachments, and hence Tika worked absolutely fine with it, tesseract didn't work. I tried to track the logs but I couldn't really find anything promising.

So just for kicks, I tried the installation in the local user directory, and Voila! It worked out of the box without anything on my part. I had to just make sure the paths were configured properly.

So I assumed at that point, that the issue was something to do with superuser privileges. Even when I ran elasticsearch with su privileges, it still couldn't interact with tesseract-ocr.

However, that's the story for ES 1.7. For ES 2.0 with mapper-attachments 3.0.2, I wasn't able to figure out how to make it work at all. Even though Tika worked fine as was evident by the fact that it could index Text, word, pdf files etc., OCR didn't work. We gave up on that, at that point, because of lack of time, and just ran with ES 1.7.2.

I can't say I have dug deeper than that. So anything would be appreciated, as that is a problem that we have still not been able to solve.

As for a PR (I assume you meant a pull request?), I am not sure that would be helpful since I have only just skimmed over the source code, and haven't really tried any of my own customizations.

Hope that helps! And please keep me posted.

Thanks

1 Like