Could not see OCR text in "content" field

I am using fs crawler with below settings

---
name: "test2"
fs:
  url: "/home/local/es"
  update_rate: "2m"
  excludes:
  - "*/~*"
  json_support: false
  filename_as_id: false
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: false
  store_source: false
  index_content: true
  attributes_support: false
  raw_metadata: false
  xml_support: false
  index_folders: true
  lang_detect: true
  continue_on_error: false
  ocr:
    enabled: true
    language: "eng"
    path: "/home/local/bin"
    data_path: "/home/local/share/tessdata"
    pdf_strategy: "ocr_and_text"
  follow_symlinks: false
elasticsearch:
  nodes:
  - url: "http://127.0.0.1:9200"
  bulk_size: 100
  flush_interval: "5s"
  byte_size: "10mb"

started with

./fscrawler --restart --trace

No error in OCR.Still not able to see the OCR text both for image,pdf and tiff

13:30:54,074 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV7] wait for yellow health on index [test2_folder]
13:30:54,077 TRACE [f.p.e.c.f.c.v.ElasticsearchClientV7] health response: {"cluster_name":"elasticsearch","status":"yellow","timed_out":false,"number_of_nodes":1,"number_of_data_nodes":1,"active_primary_shards":1,"active_shards":1,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":1,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":61.111111111111114}
13:30:54,080 DEBUG [f.p.e.c.f.FsParserAbstract] creating fs crawler thread [test2] for [/home/local/es] every [2m]
13:30:54,080 INFO  [f.p.e.c.f.FsParserAbstract] FS crawler started for [test2] for [/home/local/es] every [2m]
13:30:54,080 DEBUG [f.p.e.c.f.FsParserAbstract] Fs crawler thread [test2] is now running. Run #1...
13:30:54,088 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/home/local/es, /home/local/es) = /
13:30:54,091 DEBUG [f.p.e.c.f.FsParserAbstract] Indexing test2_folder/8a4fb678d07a821dbc3abeb4aeb337a?pipeline=null
13:30:54,091 TRACE [f.p.e.c.f.FsParserAbstract] JSon indexed : {
  "root" : "bd7d7b2d64521cab5e6db3ff54a0711c",
  "virtual" : "/",
  "real" : "/home/local/es"
}
13:30:54,094 DEBUG [f.p.e.c.f.FsParserAbstract] indexing [/home/local/es] content
13:30:54,094 DEBUG [f.p.e.c.f.c.f.FileAbstractorFile] Listing local files from /home/local/es
13:30:54,099 DEBUG [f.p.e.c.f.c.f.FileAbstractorFile] 3 local files found
13:30:54,100 TRACE [f.p.e.c.f.FsParserAbstract] FileAbstractModel = FileAbstractModel{name='test-ocr.pdf', file=true, directory=false, lastModifiedDate=2020-06-25T12:13:16.999936, creationDate=2020-06-25T12:13:16.999936, accessDate=2020-06-25T12:13:16.999936, path='/home/local/es', owner='', group='', permissions=777, extension='pdf', fullpath='/home/local/es/test-ocr.pdf', size=112983}
13:30:54,100 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/home/local/es, /home/local/es/test-ocr.pdf) = /test-ocr.pdf
13:30:54,100 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [/test-ocr.pdf], includes = [null], excludes = [[*/~*]]
13:30:54,100 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/test-ocr.pdf], excludes = [[*/~*]]
13:30:54,100 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [.*?/~.*?]
13:30:54,101 TRACE [f.p.e.c.f.f.FsCrawlerUtil] does not match any exclude pattern
13:30:54,101 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/test-ocr.pdf], includes = [null]
13:30:54,101 TRACE [f.p.e.c.f.f.FsCrawlerUtil] no include rules
13:30:54,101 DEBUG [f.p.e.c.f.FsParserAbstract] [/test-ocr.pdf] can be indexed: [true]
13:30:54,101 DEBUG [f.p.e.c.f.FsParserAbstract]   - file: /test-ocr.pdf
13:30:54,101 DEBUG [f.p.e.c.f.FsParserAbstract] fetching content from [/home/local/es],[test-ocr.pdf]
13:30:54,104 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/home/local/es, /home/local/es/test-ocr.pdf) = /test-ocr.pdf
13:30:54,105 TRACE [f.p.e.c.f.t.TikaDocParser] Generating document [/home/local/es/test-ocr.pdf]
13:30:54,113 TRACE [f.p.e.c.f.t.TikaDocParser] Beginning Tika extraction
13:30:54,128 DEBUG [f.p.e.c.f.t.TikaInstance] OCR is activated.
13:30:54,164 DEBUG [f.p.e.c.f.t.TikaInstance] OCR strategy for PDF documents is [ocr_and_text] and tesseract was found.
13:30:54,472 WARN  [o.a.t.p.PDFParser] J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

13:30:54,755 DEBUG [f.p.e.c.f.t.TikaInstance] OCR is activated so we need to configure Tesseract in case we have specific settings.
13:30:54,756 DEBUG [f.p.e.c.f.t.TikaInstance] Tesseract Path set to [/home/local/bin].
13:30:54,756 DEBUG [f.p.e.c.f.t.TikaInstance] Tesseract Data Path set to [/home/local/share/tessdata].
13:30:54,756 DEBUG [f.p.e.c.f.t.TikaInstance] Tesseract Language set to [eng].
13:30:56,277 TRACE [f.p.e.c.f.t.TikaDocParser] End of Tika extraction
13:30:56,285 TRACE [f.p.e.c.f.t.TikaDocParser] End document generation
13:30:56,285 TRACE [f.p.e.c.f.f.FsCrawlerUtil] No pattern always matches.
13:30:56,298 DEBUG [f.p.e.c.f.FsParserAbstract] Indexing test2/2e9faf9314bd6ea61ede8b7559f253b?pipeline=null
13:30:56,298 TRACE [f.p.e.c.f.FsParserAbstract] JSon indexed : {
  "content" : "\n \n\n \n\nThis file also contains text. \n\n  \n\n\n\nThis second part of the text is in Page 2 \n\n \n\n\n",
  "meta" : {
    "author" : "David Pilato",
    "title" : "Test Tika title",
    "date" : "2019-03-02T11:42:36.000+00:00",
    "keywords" : [ "keyword1", " keyword2" ],
    "language" : "en-US",
    "format" : "application/pdf; version=1.7",
    "creator_tool" : "Microsoft Word",
    "description" : "Test Tika Object",
    "created" : "2019-03-02T11:42:36.000+00:00"
  },
  "file" : {
    "extension" : "pdf",
    "content_type" : "application/pdf",
    "created" : "2020-06-25T10:13:16.999+00:00",
    "last_modified" : "2020-06-25T10:13:16.999+00:00",
    "last_accessed" : "2020-06-25T10:13:16.999+00:00",
    "indexing_date" : "2020-06-25T11:30:54.103+00:00",
    "filesize" : 112983,
    "filename" : "test-ocr.pdf",
    "url" : "file:///home/local/es/test-ocr.pdf"
  },
  "path" : {
    "root" : "8a4fb678d07a821dbc3abeb4aeb337a",
    "virtual" : "/test-ocr.pdf",
    "real" : "/home/local/es/test-ocr.pdf"
  }
}
13:30:56,298 TRACE [f.p.e.c.f.FsParserAbstract] FileAbstractModel = FileAbstractModel{name='index.png', file=true, directory=false, lastModifiedDate=2020-06-25T12:31:29.664488, creationDate=2020-06-25T12:31:29.664488, accessDate=2020-06-25T12:31:29.664488, path='/home/local/es', owner='', 

Please format your code, logs or configuration files using </> icon as explained in this guide and not the citation button. It will make your post more readable.

Or use markdown style like:

```
CODE
```

This is the icon to use if you are not using markdown format:

There's a live preview panel for exactly this reasons.

Lots of people read these forums, and many of them will simply skip over a post that is difficult to read, because it's just too large an investment of their time to try and follow a wall of badly formatted text.
If your goal is to get an answer to your questions, it's in your interest to make it as easy to read and understand as possible.
Please update your post.

Thanks for your reply. I did the formatting. Could you please check and revert.

Indeed. Some text is missing.

Are you able to run /home/local/bin/tesseract?

Yes.. Tesseract is running.. But not recognizing the text.. Value is coming as "/n"
And my tesseract executable file in /home/local/bin

Could you share the output of:

ls -l /home/local/bin/tesseract

Screenshot from 2020-06-27 05-39-12

That's weird.

What is the version of tesseract?

i used both 4.1.1 and 3.02.02, but none of these are working.

That's weird. I honestly don't know. I would need to try on a proper installation. May be it's related to this J2KImageReader warning message.... I have an opened issue about this one.

Sending you the tessdata screenshot. Please check if it is adequate or not?

I don't know. Best thing to do would be to make an image like a PNG and send it on the command line to tesseract to check that it runs correctly.

Thanks for the help.I finally got the issue.

Could you share what happened as it could be super useful for others? Thanks

Along with leptonica, we need to install couple of libraries(libpng/libjpeg/libtiff) to perform the ocr on png/jpeg.

Great to know. Is it something we should document In the project?

Not really. It should be documented in Tesseract Installation.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.