Perform OCR for more than one language by fscrawler

mohsen · November 11, 2018, 7:04am

hi,
I intend to perform OCR operations simultaneously (at the same time) in two or more languages (more than one language)
As you know, by the Config file for a Job, you can specify the type of OCR language (as below).

config file: ~/.fscrawler/job_name/_settings.json

{
  "name" : "job_name",
  "fs" : {
    "url" : "/home/monitoring_files/",
    "update_rate" : "30s",
    "excludes" : [ "~*" ],
    "json_support" : false,
    "filename_as_id" : false,
    "add_filesize" : true,
    "remove_deleted" : true,
    "add_as_inner_object" : false,
    "store_source" : false,
    "index_content" : true,
    "attributes_support" : false,
    "raw_metadata" : true,
    "xml_support" : false
    "index_folders" : true,
    "lang_detect" : false,
    "continue_on_error" : false,
    "indexed_chars": "100%",
    "pdf_ocr" : true,
    "ocr" : {
      "language" : "eng"
    }
  },
  "elasticsearch" : {
    "nodes" : [ {
      "host" : "127.0.0.1",
      "port" : 9200,
      "scheme" : "HTTP"
    } ],
    "bulk_size" : 100,
    "flush_interval" : "5s"
  },
  "rest" : {
    "scheme" : "HTTP",
    "host" : "127.0.0.1",
    "port" : 8081,
    "endpoint" : "fscrawler"
  }
}

For one language (installed Tesseract Language pack), OCR worked correctly.
But what can we do to simultaneously do OCR for multiple languages?
Something like this:

"fs" : {
    "ocr" : {
      "language": "eng+fra"
    }
  }

OR this:

"fs" : {
    "ocr" : {
      "language": "eng"
    },
   "ocr" : {
      "language": "fra"
    }
  }

Thanks ...

dadoonet · November 11, 2018, 7:21am

It's not supported. Could you open an issue in FSCrawler project?
I'm not sure if it's doable and how much complexity this will involve.

mohsen · November 11, 2018, 8:11am

Thanks for your reply.
I opened an issue in FSCrawler project:
Perform OCR for more than one language by fscrawler

mohsen · November 12, 2018, 8:09am

After investigations, I realized that it was possible in Tika to specify multiple languages for OCR.
To do this, simply concatenate the desired languages with the '+' sign.
For example: "eng+fas+fra"

The same can be done in the fscrawler and set as a value for the language attribute.

"fs" : {
    "ocr" : {
      "language": "eng+fas+fra"
    }
  }

it's working for me

But another question:
Since I did not have a good output on OCR (for the language package that I needed) I had to install version 4 of tesseract. Unfortunately, fscrawler did not recognize it. (the tests mentioned above were performed with an older version of tesseract).

I added the following settings. But it did not work and still does not do OCR (even if only one language is specified)!
(I'm not sure exactly what to give as the value for the path and the data_path attributes)

"fs" : {
    "ocr" : {
      "language": "eng+fas+fra",
      "path" : "/usr/bin/tesseract",
      "data_path" : "/usr/share/tesseract/4/tessdata/"
    }
  }

what's wrong?

dadoonet · November 12, 2018, 8:40am

That's great. Could you please send a PR to FSCrawler to document that? I think this can be super useful.

About your question, well, FSCrawler does not use directly Tesseract but is calling Tika which is calling Tesseract. I didn't check if coming versions of Tika can support it.

mohsen · November 12, 2018, 9:24am

Thanks for your reply.
Sorry, now i don't have time to send a PR. If I get enough time, I'll do it.

system · December 10, 2018, 9:24am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Multiple languages documents Elastic Search	9	864	August 23, 2022
Elastic Search Elasticsearch	1	279	July 6, 2017
Multi language support on Same Index Elasticsearch	2	782	July 6, 2017
Native Language Translation (not analyzers) Elasticsearch	1	327	July 25, 2023
Will this document structure work for multiple language indexing? Elasticsearch	2	883	July 5, 2017

Perform OCR for more than one language by fscrawler

Related topics