Read image text from pdf

Nikparab · April 14, 2017, 6:13am

Hello team,
I m using Fscrawler 2.3 snapshot.Using this I m able to load pdf in elasticsearch , but not able to read the image text from it.Is there any way to read image text??
If anyone knows this then please reply.

dadoonet · April 14, 2017, 7:17am

Have a look at OCR: https://github.com/dadoonet/fscrawler/blob/master/README.md#tips-and-tricks

That said there is a bug report: https://github.com/dadoonet/fscrawler/issues/314

Nikparab · April 14, 2017, 7:28am

ok Thanks @dadoonet
will try it

dadoonet · April 14, 2017, 7:43am

I'll be happy to have your feedback

Nikparab · April 14, 2017, 8:00am

sure @dadoonet.
I m reading the content of the link which u sent.But I m not getting the exact point which I want.Not able to understand it.I need to read it again.
If I got anything helpful then I will let you know.
Thanks

Nikparab · April 14, 2017, 10:42am

sry @dadoonet
I m not getting the content which I want.
Actually I don't want to use coding.
But thanks once again for ur quick reply.
Is there any other tool like FScrawler which can able to parse image also???
If yes then can u please tell me?
Frankly speaking I m quite weak in programming.

dadoonet · April 14, 2017, 11:03am

FSCrawler does not require any programming.

Nikparab · April 14, 2017, 11:24am

ohhkk. But how can i use That tesseract ocr with fscrawler?
can u pls tell me

Nikparab · April 14, 2017, 11:57am

and also tesserct ocr is not working on 64 bit
Is there any other option

dadoonet · April 14, 2017, 12:47pm

I don't believe Tika has any other option.

Nikparab · April 14, 2017, 1:16pm

ok.
Thanks @dadoonet for quick reply.
If u get anything regarding parsing images then please inform me.
Till that I will try tesseract ocr.
thanks once again

Nikparab · April 15, 2017, 10:12am

Hello @dadoonet

I install tesseract-alpha for windows.As per following
"To deal with images containing text, just install Tesseract. Tesseract will be auto-detected by Tika. Then add an image (png, jpg, ...) into your Fscrawler root directory. After the next index update, the text will be indexed and placed in "_source.content". "

I add images to fscrawler root directory. and run the fscrawler in cmd
It gives following output....

{
"_index": "photo",
"_type": "image",
"_id": "54d256ed121e93f6946f8e177634ff0",
"_score": 1,
"_source": {
"meta": {},
"file": {
"extension": "jpg",
"content_type": "image/jpeg",
"last_modified": "2016-12-20T10:37:22.125",
"indexing_date": "2017-04-15T15:34:16.65",
"filesize": 12952,
"filename": "bigdata.jpg",
"url": """file://C:\tmp\image\bigdata.jpg"""
},
"path": {
"encoded": "45f07b74406231761c074a1189bc9aa",
"root": "45f07b74406231761c074a1189bc9aa",
"virtual": "/",
"real": """C:\tmp\image\bigdata.jpg"""
}
}
}
]
}
}

It is not giving text in image.
What can I do in this???
Can u pls tell me????

I m referring from following link

dadoonet · April 15, 2017, 10:32am

Thanks for testing.
Probably the same issue I previously linked to. You can add your comment on the issue.

I'll try to fix it in the next months if fixable.

Nikparab · April 15, 2017, 10:35am

Thanks @dadoonet for quick reply.
In Which link i can add my comment.
Can u pls provide the link????

dadoonet · April 15, 2017, 10:49am

Nikparab · April 15, 2017, 10:49am

Thanks @dadoonet.

Nikparab · April 17, 2017, 9:31am

I tried tesseract-ocr in cmd and it able to read text from the image.
Now How can Fscrawler detect the tesseract-ocr to read image text?
Is there any configuration required?

please reply

Nikparab · April 19, 2017, 7:33am

Hello @dadoonet
I switch my os to centos 7 for read image text.
I install "tesseract-3.04.00-3.el7.x86_64.rpm" on centos and run fscrawler.
Add only images which contains text and Now I can successfully read the text of plain images.
But not able to read to read the text of image which are present in pdf after indexing the pdf in elasticsearch.

Is there any solution for that???
Please reply

dadoonet · April 19, 2017, 8:06am

May be share you PDF here or in fscrawler issue or even better in Tika project?

Nikparab · April 19, 2017, 8:13am

I am trying to upload pdf here but it gives me following error

Sorry, the file you are trying to upload is not authorized (authorized extensions: jpg, jpeg, png, gif).

Topic		Replies	Views
Can't see the text content in images that are inside pdf or word file Elasticsearch	2	325	June 5, 2019
Elastic search and fscrawler Elasticsearch	5	367	December 12, 2018
FScrawler not parsing jpg in PDF Elasticsearch	8	1322	April 1, 2020
FScrawler: perform OCR selectively only on PDF files that do not have text Elasticsearch	6	921	July 16, 2020
Tif files in fscrawler Elasticsearch	25	1957	June 22, 2020

Read image text from pdf

Related topics