Perform OCR for more than one language by fscrawler


(mohsen zanjani) #1

hi,
I intend to perform OCR operations simultaneously (at the same time) in two or more languages (more than one language)
As you know, by the Config file for a Job, you can specify the type of OCR language (as below).

config file: ~/.fscrawler/job_name/_settings.json

{
  "name" : "job_name",
  "fs" : {
    "url" : "/home/monitoring_files/",
    "update_rate" : "30s",
    "excludes" : [ "~*" ],
    "json_support" : false,
    "filename_as_id" : false,
    "add_filesize" : true,
    "remove_deleted" : true,
    "add_as_inner_object" : false,
    "store_source" : false,
    "index_content" : true,
    "attributes_support" : false,
    "raw_metadata" : true,
    "xml_support" : false
    "index_folders" : true,
    "lang_detect" : false,
    "continue_on_error" : false,
    "indexed_chars": "100%",
    "pdf_ocr" : true,
    "ocr" : {
      "language" : "eng"
    }
  },
  "elasticsearch" : {
    "nodes" : [ {
      "host" : "127.0.0.1",
      "port" : 9200,
      "scheme" : "HTTP"
    } ],
    "bulk_size" : 100,
    "flush_interval" : "5s"
  },
  "rest" : {
    "scheme" : "HTTP",
    "host" : "127.0.0.1",
    "port" : 8081,
    "endpoint" : "fscrawler"
  }
}

For one language (installed Tesseract Language pack), OCR worked correctly.
But what can we do to simultaneously do OCR for multiple languages?
Something like this:

"fs" : {
    "ocr" : {
      "language": "eng+fra"
    }
  }

OR this:

"fs" : {
    "ocr" : {
      "language": "eng"
    },
   "ocr" : {
      "language": "fra"
    }
  }

Thanks ...


(David Pilato) #2

It's not supported. Could you open an issue in FSCrawler project?
I'm not sure if it's doable and how much complexity this will involve.


(mohsen zanjani) #3

Thanks for your reply.
I opened an issue in FSCrawler project:
Perform OCR for more than one language by fscrawler


(mohsen zanjani) #4

After investigations, I realized that it was possible in Tika to specify multiple languages ​​for OCR.
To do this, simply concatenate the desired languages ​​with the '+' sign.
For example: "eng+fas+fra"

The same can be done in the fscrawler and set as a value for the language attribute.

"fs" : {
    "ocr" : {
      "language": "eng+fas+fra"
    }
  }

it's working for me :slightly_smiling_face:

But another question:
Since I did not have a good output on OCR (for the language package that I needed) I had to install version 4 of tesseract. Unfortunately, fscrawler did not recognize it. (the tests mentioned above were performed with an older version of tesseract).

I added the following settings. But it did not work and still does not do OCR (even if only one language is specified)!
(I'm not sure exactly what to give as the value for the path and the data_path attributes)

"fs" : {
    "ocr" : {
      "language": "eng+fas+fra",
      "path" : "/usr/bin/tesseract",
      "data_path" : "/usr/share/tesseract/4/tessdata/"
    }
  }

what's wrong?


(David Pilato) #5

That's great. Could you please send a PR to FSCrawler to document that? I think this can be super useful.

About your question, well, FSCrawler does not use directly Tesseract but is calling Tika which is calling Tesseract. I didn't check if coming versions of Tika can support it.


(mohsen zanjani) #6

Thanks for your reply.
Sorry, now i don't have time to send a PR. If I get enough time, I'll do it.


(system) #7

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.