Read image text from pdf


(Nikhil Chandrakant Parab) #1

Hello team,
I m using Fscrawler 2.3 snapshot.Using this I m able to load pdf in elasticsearch , but not able to read the image text from it.Is there any way to read image text??
If anyone knows this then please reply.


(David Pilato) #2

Have a look at OCR: https://github.com/dadoonet/fscrawler/blob/master/README.md#tips-and-tricks

That said there is a bug report: https://github.com/dadoonet/fscrawler/issues/314


(Nikhil Chandrakant Parab) #3

ok Thanks @dadoonet
will try it


(David Pilato) #4

I'll be happy to have your feedback


(Nikhil Chandrakant Parab) #5

sure @dadoonet.
I m reading the content of the link which u sent.But I m not getting the exact point which I want.Not able to understand it.I need to read it again.
If I got anything helpful then I will let you know.
Thanks


(Nikhil Chandrakant Parab) #6

sry @dadoonet
I m not getting the content which I want.
Actually I don't want to use coding.
But thanks once again for ur quick reply.
Is there any other tool like FScrawler which can able to parse image also???
If yes then can u please tell me?
Frankly speaking I m quite weak in programming.


(David Pilato) #7

FSCrawler does not require any programming.


(Nikhil Chandrakant Parab) #8

ohhkk. But how can i use That tesseract ocr with fscrawler?
can u pls tell me


(Nikhil Chandrakant Parab) #9

and also tesserct ocr is not working on 64 bit
Is there any other option


(David Pilato) #10

I don't believe Tika has any other option.


(Nikhil Chandrakant Parab) #11

ok.
Thanks @dadoonet for quick reply.
If u get anything regarding parsing images then please inform me.
Till that I will try tesseract ocr.
thanks once again


(Nikhil Chandrakant Parab) #12

Hello @dadoonet

I install tesseract-alpha for windows.As per following
"To deal with images containing text, just install Tesseract. Tesseract will be auto-detected by Tika. Then add an image (png, jpg, ...) into your Fscrawler root directory. After the next index update, the text will be indexed and placed in "_source.content". "

I add images to fscrawler root directory. and run the fscrawler in cmd
It gives following output....

{
"_index": "photo",
"_type": "image",
"_id": "54d256ed121e93f6946f8e177634ff0",
"_score": 1,
"_source": {
"meta": {},
"file": {
"extension": "jpg",
"content_type": "image/jpeg",
"last_modified": "2016-12-20T10:37:22.125",
"indexing_date": "2017-04-15T15:34:16.65",
"filesize": 12952,
"filename": "bigdata.jpg",
"url": """file://C:\tmp\image\bigdata.jpg"""
},
"path": {
"encoded": "45f07b74406231761c074a1189bc9aa",
"root": "45f07b74406231761c074a1189bc9aa",
"virtual": "/",
"real": """C:\tmp\image\bigdata.jpg"""
}
}
}
]
}
}

It is not giving text in image.
What can I do in this???
Can u pls tell me????:slight_smile:

I m referring from following link


(David Pilato) #13

Thanks for testing.
Probably the same issue I previously linked to. You can add your comment on the issue.

I'll try to fix it in the next months if fixable.


(Nikhil Chandrakant Parab) #14

Thanks @dadoonet for quick reply.
In Which link i can add my comment.
Can u pls provide the link????:slight_smile:


(David Pilato) #15

(Nikhil Chandrakant Parab) #16

Thanks @dadoonet.:slight_smile:


(Nikhil Chandrakant Parab) #17

I tried tesseract-ocr in cmd and it able to read text from the image.
Now How can Fscrawler detect the tesseract-ocr to read image text?
Is there any configuration required?

please reply


(Nikhil Chandrakant Parab) #18

Hello @dadoonet
I switch my os to centos 7 for read image text.
I install "tesseract-3.04.00-3.el7.x86_64.rpm" on centos and run fscrawler.
Add only images which contains text and Now I can successfully read the text of plain images.
But not able to read to read the text of image which are present in pdf after indexing the pdf in elasticsearch.

Is there any solution for that???
Please reply


(David Pilato) #19

May be share you PDF here or in fscrawler issue or even better in Tika project?


(Nikhil Chandrakant Parab) #20

I am trying to upload pdf here but it gives me following error

Sorry, the file you are trying to upload is not authorized (authorized extensions: jpg, jpeg, png, gif).