I regret that I couldn't find an OCR parser to handle image/ocr-png.Please set the OCR_STRATEGY to NO_OCR or configure yourOCR parser correctly

Hi Everyone,

I'm trying to use OCR when indexing but not getting any results.

I've tried using tesseract for images and Tika for the pdfs but not a single one will get indexed with "content".

The error message is:

09:42:43,441 INFO  [f.p.e.c.f.t.TikaInstance] OCR is enabled. This might slowdown the process.
09:42:44,195 WARN  [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [D:\OCRTESTING\2A32-WHB-001.pdf]: Unable to extract PDF content -> Unable to end a page -> I regret that I couldn't find an OCR parser to handle image/ocr-png.Please set the OCR_STRATEGY to NO_OCR or configure yourOCR parser correctly
09:42:45,413 WARN  [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [D:\OCRTESTING\616318-P79810-0023 - Red Marked.pdf]: Unable to extract PDF content -> Unable to end a page -> I regret that I couldn't find an OCR parser to handle image/ocr-png.Please set the OCR_STRATEGY to NO_OCR or configure yourOCR parser correctly
09:42:45,587 WARN  [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [D:\OCRTESTING\616318-P79810-0031 - Red Marked.pdf]: Unable to extract PDF content -> Unable to end a page -> I regret that I couldn't find an OCR parser to handle image/ocr-png.Please set the OCR_STRATEGY to NO_OCR or configure yourOCR parser correctly
09:42:45,657 WARN  [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [D:\OCRTESTING\616318-P79810-0037.pdf]: Unable to extract PDF content -> Unable to end a page -> I regret that I couldn't find an OCR parser to handle image/ocr-png.Please set the OCR_STRATEGY to NO_OCR or configure yourOCR parser correctly
09:42:45,715 WARN  [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [D:\OCRTESTING\616318-P79810-0039.pdf]: Unable to extract PDF content -> Unable to end a page -> I regret that I couldn't find an OCR parser to handle image/ocr-png.Please set the OCR_STRATEGY to NO_OCR or configure yourOCR parser correctly
09:42:45,775 WARN  [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [D:\OCRTESTING\616318-P79810-0043.pdf]: Unable to extract PDF content -> Unable to end a page -> I regret that I couldn't find an OCR parser to handle image/ocr-png.Please set the OCR_STRATEGY to NO_OCR or configure yourOCR parser correctly
09:42:45,797 WARN  [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [D:\OCRTESTING\Exception list - HP STEAM LINE - .pdf]: Unable to extract PDF content -> Unable to end a page -> I regret that I couldn't find an OCR parser to handle image/ocr-png.Please set the OCR_STRATEGY to NO_OCR or configure yourOCR parser correctly
09:42:45,838 WARN  [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [D:\OCRTESTING\MD-512-TE-2015.pdf]: Unable to extract PDF content -> Unable to end a page -> I regret that I couldn't find an OCR parser to handle image/ocr-png.Please set the OCR_STRATEGY to NO_OCR or configure yourOCR parser correctly

And my setup file looks like:

---
name: "ocr_testing"
fs:
  url: "D:\\OCRTESTING"
  update_rate: "15m"
  excludes:
  - "*/~*"
  json_support: false
  filename_as_id: false
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: false
  store_source: false
  index_content: true
  attributes_support: false
  raw_metadata: false
  xml_support: false
  index_folders: true
  lang_detect: false
  continue_on_error: false
  ocr:
    language: "eng"
    enabled: true
    pdf_strategy: "ocr_and_text"
    path: "D:/tesseract/"
    data_path: "D:/tesseract/tessdata"
    output_type: "txt"
  follow_symlinks: false
elasticsearch:
  nodes:
  - url: "http://127.0.0.1:9200"
  bulk_size: 100
  flush_interval: "5s"
  byte_size: "10mb"
  ssl_verification: true

I've installed the ingest attachment.

Everything else is standard "out of the box".

I'm currently trying to scrounge up more types of scanned docs to test with as this is my primary use case.

I will also mention that Adobe has no issues with OCR on any of the files tested so far but when saved still show with nothing in the "content".

I do realize the docs say "_source_content" but that doesn't show up either.

Any insight would be greatly appreciated.

Chris

Hey Chris.

Not related to your question but ingest attachment plugin is not needed when you are using FSCrawler.

I think that you might need to change:

path: "D:/tesseract/"
data_path: "D:/tesseract/tessdata"

To:

path: "D:\\tesseract"
data_path: "D:\\tesseract\\tessdata"

Could you try?

Also run fscrawler with --debug so I can see more information.

Hi David, to the rescue again... I appreciate it.

11:54:52,219 INFO  [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
11:54:52,219 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler started in watch mode. It will run unless you stop it with CTRL+C.
11:54:52,815 WARN  [o.e.c.RestClient] request [GET http://127.0.0.1:9200/] returned 1 warnings: [299 Elasticsearch-7.17.1-e5acb99f822233d62d6444ce45a4543dc1c8059a "Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.17/security-minimal-setup.html to enable security."]
11:54:52,836 WARN  [o.e.c.RestClient] request [GET http://127.0.0.1:9200/] returned 1 warnings: [299 Elasticsearch-7.17.1-e5acb99f822233d62d6444ce45a4543dc1c8059a "Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.17/security-minimal-setup.html to enable security."]
11:54:52,839 WARN  [o.e.c.RestClient] request [GET http://127.0.0.1:9200/] returned 1 warnings: [299 Elasticsearch-7.17.1-e5acb99f822233d62d6444ce45a4543dc1c8059a "Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.17/security-minimal-setup.html to enable security."]
11:54:52,840 INFO  [f.p.e.c.f.c.v.ElasticsearchClientV7] Elasticsearch Client for version 7.x connected to a node running version 7.17.1
11:54:53,009 WARN  [o.e.c.RestClient] request [GET http://127.0.0.1:9200/] returned 1 warnings: [299 Elasticsearch-7.17.1-e5acb99f822233d62d6444ce45a4543dc1c8059a "Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.17/security-minimal-setup.html to enable security."]
11:54:53,013 WARN  [o.e.c.RestClient] request [GET http://127.0.0.1:9200/] returned 1 warnings: [299 Elasticsearch-7.17.1-e5acb99f822233d62d6444ce45a4543dc1c8059a "Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.17/security-minimal-setup.html to enable security."]
11:54:53,015 WARN  [o.e.c.RestClient] request [GET http://127.0.0.1:9200/] returned 1 warnings: [299 Elasticsearch-7.17.1-e5acb99f822233d62d6444ce45a4543dc1c8059a "Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.17/security-minimal-setup.html to enable security."]
11:54:53,016 INFO  [f.p.e.c.f.c.v.ElasticsearchClientV7] Elasticsearch Client for version 7.x connected to a node running version 7.17.1
11:54:53,019 WARN  [o.e.c.RestClient] request [GET http://127.0.0.1:9200/] returned 1 warnings: [299 Elasticsearch-7.17.1-e5acb99f822233d62d6444ce45a4543dc1c8059a "Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.17/security-minimal-setup.html to enable security."]
11:54:53,330 WARN  [o.e.c.RestClient] request [PUT http://127.0.0.1:9200/ocr_testing?master_timeout=30s&timeout=30s] returned 2 warnings: [299 Elasticsearch-7.17.1-e5acb99f822233d62d6444ce45a4543dc1c8059a "Camel case format name dateOptionalTime is deprecated and will be removed in a future version. Use snake case name date_optional_time instead."],[299 Elasticsearch-7.17.1-e5acb99f822233d62d6444ce45a4543dc1c8059a "Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.17/security-minimal-setup.html to enable security."]
11:54:53,338 WARN  [o.e.c.RestClient] request [GET http://127.0.0.1:9200/_cluster/health/ocr_testing?master_timeout=30s&level=cluster&timeout=30s&wait_for_status=yellow] returned 1 warnings: [299 Elasticsearch-7.17.1-e5acb99f822233d62d6444ce45a4543dc1c8059a "Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.17/security-minimal-setup.html to enable security."]
11:54:53,531 WARN  [o.e.c.RestClient] request [PUT http://127.0.0.1:9200/ocr_testing_folder?master_timeout=30s&timeout=30s] returned 1 warnings: [299 Elasticsearch-7.17.1-e5acb99f822233d62d6444ce45a4543dc1c8059a "Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.17/security-minimal-setup.html to enable security."]
11:54:53,536 WARN  [o.e.c.RestClient] request [GET http://127.0.0.1:9200/_cluster/health/ocr_testing_folder?master_timeout=30s&level=cluster&timeout=30s&wait_for_status=yellow] returned 1 warnings: [299 Elasticsearch-7.17.1-e5acb99f822233d62d6444ce45a4543dc1c8059a "Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.17/security-minimal-setup.html to enable security."]
11:54:53,541 INFO  [f.p.e.c.f.FsParserAbstract] FS crawler started for [ocr_testing] for [D:\OCRTESTING] every [1m]
11:54:53,543 WARN  [o.e.c.RestClient] request [GET http://127.0.0.1:9200/] returned 1 warnings: [299 Elasticsearch-7.17.1-e5acb99f822233d62d6444ce45a4543dc1c8059a "Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.17/security-minimal-setup.html to enable security."]
11:54:53,668 INFO  [f.p.e.c.f.t.TikaInstance] OCR is enabled. This might slowdown the process.
11:54:54,365 WARN  [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [D:\OCRTESTING\2A32-WHB-001.pdf]: Unable to extract PDF content -> Unable to end a page -> I regret that I couldn't find an OCR parser to handle image/ocr-png.Please set the OCR_STRATEGY to NO_OCR or configure yourOCR parser correctly
11:54:55,525 WARN  [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [D:\OCRTESTING\616318-P79810-0023 - Red Marked.pdf]: Unable to extract PDF content -> Unable to end a page -> I regret that I couldn't find an OCR parser to handle image/ocr-png.Please set the OCR_STRATEGY to NO_OCR or configure yourOCR parser correctly
11:54:55,695 WARN  [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [D:\OCRTESTING\616318-P79810-0031 - Red Marked.pdf]: Unable to extract PDF content -> Unable to end a page -> I regret that I couldn't find an OCR parser to handle image/ocr-png.Please set the OCR_STRATEGY to NO_OCR or configure yourOCR parser correctly
11:54:55,770 WARN  [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [D:\OCRTESTING\616318-P79810-0037.pdf]: Unable to extract PDF content -> Unable to end a page -> I regret that I couldn't find an OCR parser to handle image/ocr-png.Please set the OCR_STRATEGY to NO_OCR or configure yourOCR parser correctly
11:54:55,824 WARN  [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [D:\OCRTESTING\616318-P79810-0039.pdf]: Unable to extract PDF content -> Unable to end a page -> I regret that I couldn't find an OCR parser to handle image/ocr-png.Please set the OCR_STRATEGY to NO_OCR or configure yourOCR parser correctly
11:54:55,879 WARN  [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [D:\OCRTESTING\616318-P79810-0043.pdf]: Unable to extract PDF content -> Unable to end a page -> I regret that I couldn't find an OCR parser to handle image/ocr-png.Please set the OCR_STRATEGY to NO_OCR or configure yourOCR parser correctly
11:54:55,893 WARN  [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [D:\OCRTESTING\Exception list - HP STEAM LINE - .pdf]: Unable to extract PDF content -> Unable to end a page -> I regret that I couldn't find an OCR parser to handle image/ocr-png.Please set the OCR_STRATEGY to NO_OCR or configure yourOCR parser correctly
11:54:55,999 WARN  [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [D:\OCRTESTING\MD-512-TE-2015.pdf]: Unable to extract PDF content -> Unable to end a page -> I regret that I couldn't find an OCR parser to handle image/ocr-png.Please set the OCR_STRATEGY to NO_OCR or configure yourOCR parser correctly
11:54:56,034 WARN  [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [D:\OCRTESTING\Office zoom in.pdf]: Unable to extract PDF content -> Unable to end a page -> I regret that I couldn't find an OCR parser to handle image/ocr-png.Please set the OCR_STRATEGY to NO_OCR or configure yourOCR parser correctly
11:54:56,074 WARN  [o.e.c.RestClient] request [POST http://127.0.0.1:9200/ocr_testing/_search?typed_keys=true&max_concurrent_shard_requests=5&ignore_unavailable=true&expand_wildcards=open&allow_no_indices=true&ignore_throttled=false&search_type=query_then_fetch&batched_reduce_size=512] returned 2 warnings: [299 Elasticsearch-7.17.1-e5acb99f822233d62d6444ce45a4543dc1c8059a "Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.17/security-minimal-setup.html to enable security."],[299 Elasticsearch-7.17.1-e5acb99f822233d62d6444ce45a4543dc1c8059a "[ignore_throttled] parameter is deprecated because frozen indices have been deprecated. Consider cold or frozen tiers in place of frozen indices."]
11:54:56,089 WARN  [o.e.c.RestClient] request [POST http://127.0.0.1:9200/ocr_testing_folder/_search?typed_keys=true&max_concurrent_shard_requests=5&ignore_unavailable=true&expand_wildcards=open&allow_no_indices=true&ignore_throttled=false&search_type=query_then_fetch&batched_reduce_size=512] returned 2 warnings: [299 Elasticsearch-7.17.1-e5acb99f822233d62d6444ce45a4543dc1c8059a "Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.17/security-minimal-setup.html to enable security."],[299 Elasticsearch-7.17.1-e5acb99f822233d62d6444ce45a4543dc1c8059a "[ignore_throttled] parameter is deprecated because frozen indices have been deprecated. Consider cold or frozen tiers in place of frozen indices."]
11:54:57,994 WARN  [o.e.c.RestClient] request [POST http://127.0.0.1:9200/_bulk?timeout=1m] returned 1 warnings: [299 Elasticsearch-7.17.1-e5acb99f822233d62d6444ce45a4543dc1c8059a "Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.17/security-minimal-setup.html to enable security."]
11:54:58,040 WARN  [o.e.c.RestClient] request [POST http://127.0.0.1:9200/_bulk?timeout=1m] returned 1 warnings: [299 Elasticsearch-7.17.1-e5acb99f822233d62d6444ce45a4543dc1c8059a "Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.17/security-minimal-setup.html to enable security."]
11:55:56,131 WARN  [o.e.c.RestClient] request [POST http://127.0.0.1:9200/ocr_testing/_search?typed_keys=true&max_concurrent_shard_requests=5&ignore_unavailable=true&expand_wildcards=open&allow_no_indices=true&ignore_throttled=false&search_type=query_then_fetch&batched_reduce_size=512] returned 2 warnings: [299 Elasticsearch-7.17.1-e5acb99f822233d62d6444ce45a4543dc1c8059a "Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.17/security-minimal-setup.html to enable security."],[299 Elasticsearch-7.17.1-e5acb99f822233d62d6444ce45a4543dc1c8059a "[ignore_throttled] parameter is deprecated because frozen indices have been deprecated. Consider cold or frozen tiers in place of frozen indices."]
11:55:56,145 WARN  [o.e.c.RestClient] request [POST http://127.0.0.1:9200/ocr_testing_folder/_search?typed_keys=true&max_concurrent_shard_requests=5&ignore_unavailable=true&expand_wildcards=open&allow_no_indices=true&ignore_throttled=false&search_type=query_then_fetch&batched_reduce_size=512] returned 2 warnings: [299 Elasticsearch-7.17.1-e5acb99f822233d62d6444ce45a4543dc1c8059a "Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.17/security-minimal-setup.html to enable security."],[299 Elasticsearch-7.17.1-e5acb99f822233d62d6444ce45a4543dc1c8059a "[ignore_throttled] parameter is deprecated because frozen indices have been deprecated. Consider cold or frozen tiers in place of frozen indices."]
11:56:56,177 WARN  [o.e.c.RestClient] request [POST http://127.0.0.1:9200/ocr_testing/_search?typed_keys=true&max_concurrent_shard_requests=5&ignore_unavailable=true&expand_wildcards=open&allow_no_indices=true&ignore_throttled=false&search_type=query_then_fetch&batched_reduce_size=512] returned 2 warnings: [299 Elasticsearch-7.17.1-e5acb99f822233d62d6444ce45a4543dc1c8059a "Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.17/security-minimal-setup.html to enable security."],[299 Elasticsearch-7.17.1-e5acb99f822233d62d6444ce45a4543dc1c8059a "[ignore_throttled] parameter is deprecated because frozen indices have been deprecated. Consider cold or frozen tiers in place of frozen indices."]
11:56:56,187 WARN  [o.e.c.RestClient] request [POST http://127.0.0.1:9200/ocr_testing_folder/_search?typed_keys=true&max_concurrent_shard_requests=5&ignore_unavailable=true&expand_wildcards=open&allow_no_indices=true&ignore_throttled=false&search_type=query_then_fetch&batched_reduce_size=512] returned 2 warnings: [299 Elasticsearch-7.17.1-e5acb99f822233d62d6444ce45a4543dc1c8059a "Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See https://www.elastic.co/guide/en/elasticsearch/reference/7.17/security-minimal-setup.html to enable security."],[299 Elasticsearch-7.17.1-e5acb99f822233d62d6444ce45a4543dc1c8059a "[ignore_throttled] parameter is deprecated because frozen indices have been deprecated. Consider cold or frozen tiers in place of frozen indices."]

_settings.yml

---
name: "ocr_testing"
fs:
  url: "D:\\OCRTESTING"
  update_rate: "1m"
  excludes:
  - "*/~*"
  json_support: false
  filename_as_id: false
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: false
  store_source: false
  index_content: true
  attributes_support: false
  raw_metadata: false
  xml_support: false
  index_folders: true
  lang_detect: false
  continue_on_error: false
  ocr:
    language: "eng"
    enabled: true
    pdf_strategy: "ocr_and_text"
    path: "D:\\tesseract"
    data_path: "D:\\tesseract\\tessdata"
    output_type: "txt"
  follow_symlinks: false
elasticsearch:
  nodes:
  - url: "http://127.0.0.1:9200"
  bulk_size: 100
  flush_interval: "5s"
  byte_size: "10mb"
  ssl_verification: true

There doesn't seem to be any extra information, .\bin\fscrawler ocr_testing --debug

I can't find any log file either: config/log4j2.xml

_status.json is also empty, same for the other indices not sure if that's an issue.

{
  "name" : "ocr_testing",
  "lastrun" : "2022-03-12T12:05:54.4408395",
  "indexed" : 0,
  "deleted" : 0
}

I threw a png in there to kick it off but it wouldn't work so deleted the index and remade it again using fscrawler.

I've tried restarting elastic also, could it be something else unrelated to this like a PC setting or something? Something installed or not installed? Environment settings? java versions, python versions?

Which FSCrawler version are you using?
Could you share your document you are trying to index?

Hi David,

Elasticsearch 7.17.1
Kibana 7.17.1
fscrawler es7-2.9

Sure, privately no problem.

Regards,

Ok great. Yes please, DM me.

About the debug/trace issue, I think I solved it in the latest 2.10-SNAPSHOT.

For now, you can define a user setting FS_JAVA_OPTS set to -DLOG_LEVEL=debug.

Hi David, couldn't attach the PDF's so I've taken the liberty of sending them to you via

mailto:david@pilato.fr

Hope that's ok.

Regards,

Chris

Hi David,

je suis désolé,

Everything is working now.

Reboot this morning and added Harvard Dataset

It's even working on the old files. The only messages coming through in PS are the security warnings.

Appreciate you, thanks.

Great news!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.