Could not see OCR text in "content" field

saikat189 · June 25, 2020, 11:39am

I am using fs crawler with below settings

---
name: "test2"
fs:
  url: "/home/local/es"
  update_rate: "2m"
  excludes:
  - "*/~*"
  json_support: false
  filename_as_id: false
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: false
  store_source: false
  index_content: true
  attributes_support: false
  raw_metadata: false
  xml_support: false
  index_folders: true
  lang_detect: true
  continue_on_error: false
  ocr:
    enabled: true
    language: "eng"
    path: "/home/local/bin"
    data_path: "/home/local/share/tessdata"
    pdf_strategy: "ocr_and_text"
  follow_symlinks: false
elasticsearch:
  nodes:
  - url: "http://127.0.0.1:9200"
  bulk_size: 100
  flush_interval: "5s"
  byte_size: "10mb"

started with

./fscrawler --restart --trace

No error in OCR.Still not able to see the OCR text both for image,pdf and tiff

saikat189 · June 25, 2020, 11:44am

13:30:54,074 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV7] wait for yellow health on index [test2_folder]
13:30:54,077 TRACE [f.p.e.c.f.c.v.ElasticsearchClientV7] health response: {"cluster_name":"elasticsearch","status":"yellow","timed_out":false,"number_of_nodes":1,"number_of_data_nodes":1,"active_primary_shards":1,"active_shards":1,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":1,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":61.111111111111114}
13:30:54,080 DEBUG [f.p.e.c.f.FsParserAbstract] creating fs crawler thread [test2] for [/home/local/es] every [2m]
13:30:54,080 INFO  [f.p.e.c.f.FsParserAbstract] FS crawler started for [test2] for [/home/local/es] every [2m]
13:30:54,080 DEBUG [f.p.e.c.f.FsParserAbstract] Fs crawler thread [test2] is now running. Run #1...
13:30:54,088 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/home/local/es, /home/local/es) = /
13:30:54,091 DEBUG [f.p.e.c.f.FsParserAbstract] Indexing test2_folder/8a4fb678d07a821dbc3abeb4aeb337a?pipeline=null
13:30:54,091 TRACE [f.p.e.c.f.FsParserAbstract] JSon indexed : {
  "root" : "bd7d7b2d64521cab5e6db3ff54a0711c",
  "virtual" : "/",
  "real" : "/home/local/es"
}
13:30:54,094 DEBUG [f.p.e.c.f.FsParserAbstract] indexing [/home/local/es] content
13:30:54,094 DEBUG [f.p.e.c.f.c.f.FileAbstractorFile] Listing local files from /home/local/es
13:30:54,099 DEBUG [f.p.e.c.f.c.f.FileAbstractorFile] 3 local files found
13:30:54,100 TRACE [f.p.e.c.f.FsParserAbstract] FileAbstractModel = FileAbstractModel{name='test-ocr.pdf', file=true, directory=false, lastModifiedDate=2020-06-25T12:13:16.999936, creationDate=2020-06-25T12:13:16.999936, accessDate=2020-06-25T12:13:16.999936, path='/home/local/es', owner='', group='', permissions=777, extension='pdf', fullpath='/home/local/es/test-ocr.pdf', size=112983}
13:30:54,100 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/home/local/es, /home/local/es/test-ocr.pdf) = /test-ocr.pdf
13:30:54,100 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [/test-ocr.pdf], includes = [null], excludes = [[*/~*]]
13:30:54,100 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/test-ocr.pdf], excludes = [[*/~*]]
13:30:54,100 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [.*?/~.*?]
13:30:54,101 TRACE [f.p.e.c.f.f.FsCrawlerUtil] does not match any exclude pattern
13:30:54,101 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/test-ocr.pdf], includes = [null]
13:30:54,101 TRACE [f.p.e.c.f.f.FsCrawlerUtil] no include rules
13:30:54,101 DEBUG [f.p.e.c.f.FsParserAbstract] [/test-ocr.pdf] can be indexed: [true]
13:30:54,101 DEBUG [f.p.e.c.f.FsParserAbstract]   - file: /test-ocr.pdf
13:30:54,101 DEBUG [f.p.e.c.f.FsParserAbstract] fetching content from [/home/local/es],[test-ocr.pdf]
13:30:54,104 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/home/local/es, /home/local/es/test-ocr.pdf) = /test-ocr.pdf
13:30:54,105 TRACE [f.p.e.c.f.t.TikaDocParser] Generating document [/home/local/es/test-ocr.pdf]
13:30:54,113 TRACE [f.p.e.c.f.t.TikaDocParser] Beginning Tika extraction
13:30:54,128 DEBUG [f.p.e.c.f.t.TikaInstance] OCR is activated.
13:30:54,164 DEBUG [f.p.e.c.f.t.TikaInstance] OCR strategy for PDF documents is [ocr_and_text] and tesseract was found.
13:30:54,472 WARN  [o.a.t.p.PDFParser] J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

13:30:54,755 DEBUG [f.p.e.c.f.t.TikaInstance] OCR is activated so we need to configure Tesseract in case we have specific settings.
13:30:54,756 DEBUG [f.p.e.c.f.t.TikaInstance] Tesseract Path set to [/home/local/bin].
13:30:54,756 DEBUG [f.p.e.c.f.t.TikaInstance] Tesseract Data Path set to [/home/local/share/tessdata].
13:30:54,756 DEBUG [f.p.e.c.f.t.TikaInstance] Tesseract Language set to [eng].
13:30:56,277 TRACE [f.p.e.c.f.t.TikaDocParser] End of Tika extraction
13:30:56,285 TRACE [f.p.e.c.f.t.TikaDocParser] End document generation
13:30:56,285 TRACE [f.p.e.c.f.f.FsCrawlerUtil] No pattern always matches.
13:30:56,298 DEBUG [f.p.e.c.f.FsParserAbstract] Indexing test2/2e9faf9314bd6ea61ede8b7559f253b?pipeline=null
13:30:56,298 TRACE [f.p.e.c.f.FsParserAbstract] JSon indexed : {
  "content" : "\n \n\n \n\nThis file also contains text. \n\n  \n\n\n\nThis second part of the text is in Page 2 \n\n \n\n\n",
  "meta" : {
    "author" : "David Pilato",
    "title" : "Test Tika title",
    "date" : "2019-03-02T11:42:36.000+00:00",
    "keywords" : [ "keyword1", " keyword2" ],
    "language" : "en-US",
    "format" : "application/pdf; version=1.7",
    "creator_tool" : "Microsoft Word",
    "description" : "Test Tika Object",
    "created" : "2019-03-02T11:42:36.000+00:00"
  },
  "file" : {
    "extension" : "pdf",
    "content_type" : "application/pdf",
    "created" : "2020-06-25T10:13:16.999+00:00",
    "last_modified" : "2020-06-25T10:13:16.999+00:00",
    "last_accessed" : "2020-06-25T10:13:16.999+00:00",
    "indexing_date" : "2020-06-25T11:30:54.103+00:00",
    "filesize" : 112983,
    "filename" : "test-ocr.pdf",
    "url" : "file:///home/local/es/test-ocr.pdf"
  },
  "path" : {
    "root" : "8a4fb678d07a821dbc3abeb4aeb337a",
    "virtual" : "/test-ocr.pdf",
    "real" : "/home/local/es/test-ocr.pdf"
  }
}
13:30:56,298 TRACE [f.p.e.c.f.FsParserAbstract] FileAbstractModel = FileAbstractModel{name='index.png', file=true, directory=false, lastModifiedDate=2020-06-25T12:31:29.664488, creationDate=2020-06-25T12:31:29.664488, accessDate=2020-06-25T12:31:29.664488, path='/home/local/es', owner='',

dadoonet · June 25, 2020, 11:48am

Please format your code, logs or configuration files using </> icon as explained in this guide and not the citation button. It will make your post more readable.

Or use markdown style like:

```
CODE
```

This is the icon to use if you are not using markdown format:

There's a live preview panel for exactly this reasons.

Lots of people read these forums, and many of them will simply skip over a post that is difficult to read, because it's just too large an investment of their time to try and follow a wall of badly formatted text.
If your goal is to get an answer to your questions, it's in your interest to make it as easy to read and understand as possible.
Please update your post.

saikat189 · June 25, 2020, 3:41pm

Thanks for your reply. I did the formatting. Could you please check and revert.

dadoonet · June 26, 2020, 3:59pm

Indeed. Some text is missing.

Are you able to run /home/local/bin/tesseract?

saikat189 · June 26, 2020, 6:26pm

Yes.. Tesseract is running.. But not recognizing the text.. Value is coming as "/n"
And my tesseract executable file in /home/local/bin

dadoonet · June 26, 2020, 11:48pm

Could you share the output of:

ls -l /home/local/bin/tesseract

saikat189 · June 27, 2020, 3:40am

Screenshot from 2020-06-27 05-39-12

dadoonet · July 2, 2020, 9:55am

That's weird.

What is the version of tesseract?

saikat189 · July 2, 2020, 4:16pm

i used both 4.1.1 and 3.02.02, but none of these are working.

dadoonet · July 3, 2020, 10:01am

That's weird. I honestly don't know. I would need to try on a proper installation. May be it's related to this J2KImageReader warning message.... I have an opened issue about this one.

saikat189 · July 3, 2020, 10:36am

Sending you the tessdata screenshot. Please check if it is adequate or not?

saikat189 · July 3, 2020, 10:37am

dadoonet · July 3, 2020, 10:56am

I don't know. Best thing to do would be to make an image like a PNG and send it on the command line to tesseract to check that it runs correctly.

saikat189 · July 3, 2020, 11:58am

Thanks for the help.I finally got the issue.

dadoonet · July 3, 2020, 2:13pm

Could you share what happened as it could be super useful for others? Thanks

saikat189 · July 6, 2020, 4:18am

Along with leptonica, we need to install couple of libraries(libpng/libjpeg/libtiff) to perform the ocr on png/jpeg.

dadoonet · July 6, 2020, 5:46am

Great to know. Is it something we should document In the project?

saikat189 · July 6, 2020, 7:40am

Not really. It should be documented in Tesseract Installation.

system · August 3, 2020, 7:40am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Fscrawler ocr question Elasticsearch	2	333	October 23, 2019
FS Crawler - Issue with OCR Elasticsearch docker	7	928	September 2, 2022
Fscrawler image file text extraction Elasticsearch	7	739	August 22, 2021
Fscrawler does not index to ES with https Elasticsearch	4	1033	October 27, 2020
Unable to extract PDF content Elasticsearch	5	193	April 14, 2024

Could not see OCR text in "content" field

Related topics