Read image text from pdf

Hello team,
I m using Fscrawler 2.3 snapshot.Using this I m able to load pdf in elasticsearch , but not able to read the image text from it.Is there any way to read image text??
If anyone knows this then please reply.

Have a look at OCR: https://github.com/dadoonet/fscrawler/blob/master/README.md#tips-and-tricks

That said there is a bug report: https://github.com/dadoonet/fscrawler/issues/314

ok Thanks @dadoonet
will try it

I'll be happy to have your feedback

sure @dadoonet.
I m reading the content of the link which u sent.But I m not getting the exact point which I want.Not able to understand it.I need to read it again.
If I got anything helpful then I will let you know.
Thanks

sry @dadoonet
I m not getting the content which I want.
Actually I don't want to use coding.
But thanks once again for ur quick reply.
Is there any other tool like FScrawler which can able to parse image also???
If yes then can u please tell me?
Frankly speaking I m quite weak in programming.

FSCrawler does not require any programming.

ohhkk. But how can i use That tesseract ocr with fscrawler?
can u pls tell me

and also tesserct ocr is not working on 64 bit
Is there any other option

I don't believe Tika has any other option.

ok.
Thanks @dadoonet for quick reply.
If u get anything regarding parsing images then please inform me.
Till that I will try tesseract ocr.
thanks once again

Hello @dadoonet

I install tesseract-alpha for windows.As per following
"To deal with images containing text, just install Tesseract. Tesseract will be auto-detected by Tika. Then add an image (png, jpg, ...) into your Fscrawler root directory. After the next index update, the text will be indexed and placed in "_source.content". "

I add images to fscrawler root directory. and run the fscrawler in cmd
It gives following output....

{
"_index": "photo",
"_type": "image",
"_id": "54d256ed121e93f6946f8e177634ff0",
"_score": 1,
"_source": {
"meta": {},
"file": {
"extension": "jpg",
"content_type": "image/jpeg",
"last_modified": "2016-12-20T10:37:22.125",
"indexing_date": "2017-04-15T15:34:16.65",
"filesize": 12952,
"filename": "bigdata.jpg",
"url": """file://C:\tmp\image\bigdata.jpg"""
},
"path": {
"encoded": "45f07b74406231761c074a1189bc9aa",
"root": "45f07b74406231761c074a1189bc9aa",
"virtual": "/",
"real": """C:\tmp\image\bigdata.jpg"""
}
}
}
]
}
}

It is not giving text in image.
What can I do in this???
Can u pls tell me????:slight_smile:

I m referring from following link

Thanks for testing.
Probably the same issue I previously linked to. You can add your comment on the issue.

I'll try to fix it in the next months if fixable.

Thanks @dadoonet for quick reply.
In Which link i can add my comment.
Can u pls provide the link????:slight_smile:

Thanks @dadoonet.:slight_smile:

I tried tesseract-ocr in cmd and it able to read text from the image.
Now How can Fscrawler detect the tesseract-ocr to read image text?
Is there any configuration required?

please reply

Hello @dadoonet
I switch my os to centos 7 for read image text.
I install "tesseract-3.04.00-3.el7.x86_64.rpm" on centos and run fscrawler.
Add only images which contains text and Now I can successfully read the text of plain images.
But not able to read to read the text of image which are present in pdf after indexing the pdf in elasticsearch.

Is there any solution for that???
Please reply

May be share you PDF here or in fscrawler issue or even better in Tika project?

I am trying to upload pdf here but it gives me following error

Sorry, the file you are trying to upload is not authorized (authorized extensions: jpg, jpeg, png, gif).