NitzaAg
(Nitza Ag)
February 29, 2024, 9:15pm
1
I'm trying to extract text from a pdf, but I get the following:
Unable to extract PDF content -> Unable to end a page -> I regret that I couldn't find an OCR parser to handle image/ocr-png.Please set the OCR_STRATEGY to NO_OCR or configure yourOCR parser correctly
My configuration looks like this:
---
name: "ocr_docs"
fs:
url: "C:\\Users\\Documents\\docs2ocr"
update_rate: "15m"
excludes:
- "*\\~*"
json_support: false
filename_as_id: false
add_filesize: true
remove_deleted: true
add_as_inner_object: false
store_source: false
index_content: true
attributes_support: true
raw_metadata: true
xml_support: false
index_folders: true
lang_detect: false
continue_on_error: false
ocr:
language: "eng"
enabled: true
pdf_strategy: "ocr_and_text"
path: "C:\\Program Files\\https%3a%2f%2fmirrors.163.com%2fcygwin%2f\\x86_64\\release\\tesseract-ocr\\tesseract-ocr-5.3.3-1\\usr\\bin"
data_path: "C:\\Program Files\\https%3a%2f%2fmirrors.163.com%2fcygwin%2f\\x86_64\\release\\tesseract-ocr\\tesseract-ocr-5.3.3-1\\usr\\share\\tessdata"
follow_symlinks: false
elasticsearch:
nodes:
- cloud_id: "cloud_id"
username: "username"
password: "password"
bulk_size: 100
flush_interval: "5s"
byte_size: "10mb"
I'm using fscrawler-distribution-2.10-20231023.160816-291
dadoonet
(David Pilato)
February 29, 2024, 11:10pm
2
Welcome!
I think you should find a more recent build than "20231023".
The latest I can see is fscrawler-distribution-2.10-20240213.145447-315.zip .
But I don't think that will fix your problem. Could you share your PDF document so I can try it out? If you can't share it publicly, you can send me a private message.
NitzaAg
(Nitza Ag)
March 1, 2024, 4:43pm
3
Thank you! I sent you a dm with the pdf doc. Please let me know if more information is needed
dadoonet
(David Pilato)
March 3, 2024, 12:33pm
4
So I tried your document and was able to get its content. Something like *** 18 de septiembre de 2013 ***
(skipping the rest)...
May be this does not work on windows or that was caused by an old build of FSCrawler?
Could you try with the most recent build?
If this does not work, could you try with the FSCrawler Docker images?
NitzaAg
(Nitza Ag)
March 12, 2024, 10:41pm
5
dadoonet:
cument and was able to get its content. Something like *** 18 de septiembre de 2013 ***
(skipping the rest)...
May be this does not work on windows or that was caused by an old build of FSCrawler?
Could you try with the most recent build?
If this does not work, could you try with the FSCrawler Docker images?
I tried that build and didn't work. However, it worked after using the FSCrawler Docker images. Thanks for the help
system
(system)
Closed
April 14, 2024, 9:17pm
8
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.