Tesseract-OCR only returns new lines

I am using Tesseract-OCR to extract text from pdfs with fscrawler. I have tried configuring _settings.yaml with the path and data_path for Tesseract and also tried it without the path and data_path. However, when I use Kibana to look at the content of the OCR-ed pdfs, I just get new lines or empty space.

I also used jpgs and pngs to test if Tesseract can OCR those files, but fscrawler is not even reading jpgs or pngs. No content is extracted from jpgs or pngs.

On Python, I used pytesseract to check if Tesseract works on a jpg file with the same image as the pdfs I fed to fscrawler and pytesseract was able to pick up the text.

My questions are:

  1. Is Tesseract-OCR just not working with fscrawler and how can I tell? (The command prompt logs seem fine. There is no error saying that OCR is not going to be performed.)

  2. Do I need to convert the pdfs to jpgs first for Tesseract to work and if that is the case, what should I do about the issue that fscrawler is not reading jpgs?

Thanks for your help!



It's probably a misconfiguration or something like this.

Is Tesseract-OCR just not working with fscrawler and how can I tell? (The command prompt logs seem fine. There is no error saying that OCR is not going to be performed.)

Could you share you FSCrawler job settings, run FSCrawler with --trace on a directory which contains only one file (pdf ou png) and share the full logs?

Do I need to convert the pdfs to jpgs first for Tesseract to work and if that is the case, what should I do about the issue that fscrawler is not reading jpgs?

no. This should work. I do have some integration tests which shows that this is working.
If you still can't make it work, could you share an example of a PDF document that can be reused in tests?

Hi David,

Thanks for your response. Below is the FSCrawler job settings:

      url: "C:\\Data_Privacy_GAT\\Testing_Tesseract_2"
      update_rate: "15m"
      - "*/~*"
      json_support: false
      filename_as_id: true
      add_filesize: true
      remove_deleted: true
      add_as_inner_object: false
      store_source: false
      index_content: true
      attributes_support: false
      raw_metadata: true
      xml_support: false
      index_folders: true
      lang_detect: true
      continue_on_error: true
        language: "eng"
        enabled: true
        pdf_strategy: "ocr_and_text"
        follow_symlinks: false

And I will paste the trace in another post because I would exceed character limit.

Here is the trace. I will paste in two comments.

    16:20:37,756 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV7] create index [testing_t
    16:20:37,756 TRACE [f.p.e.c.f.c.v.ElasticsearchClientV7] index settings: [{
      "settings": {
        "analysis": {
          "analyzer": {
            "fscrawler_path": {
              "tokenizer": "fscrawler_path"
          "tokenizer": {
            "fscrawler_path": {
              "type": "path_hierarchy"
      "mappings": {
        "properties" : {
          "real" : {
            "type" : "keyword",
            "store" : true
          "root" : {
            "type" : "keyword",
            "store" : true
          "virtual" : {
            "type" : "keyword",
            "store" : true
    16:20:38,053 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV7] wait for yellow health
    on index [testing_tesseract_v5_folder]
    16:20:38,053 TRACE [f.p.e.c.f.c.v.ElasticsearchClientV7] health response: {"clus
    16:20:38,053 DEBUG [f.p.e.c.f.FsParserAbstract] creating fs crawler thread [test
    ing_tesseract_v5] for [C:\Data_Privacy_GAT\Testing_Tesseract_2] every [15m]
    16:20:38,053 INFO  [f.p.e.c.f.FsParserAbstract] FS crawler started for [testing_
    tesseract_v5] for [C:\Data_Privacy_GAT\Testing_Tesseract_2] every [15m]
    16:20:38,069 DEBUG [f.p.e.c.f.FsParserAbstract] Fs crawler thread [testing_tesse
    ract_v5] is now running. Run #1...
    16:20:38,069 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(C:\Data_Pr
    ivacy_GAT\Testing_Tesseract_2, C:\Data_Privacy_GAT\Testing_Tesseract_2) = /
    16:20:38,085 DEBUG [f.p.e.c.f.FsParserAbstract] Indexing testing_tesseract_v5_fo
    16:20:38,085 TRACE [f.p.e.c.f.FsParserAbstract] JSon indexed : {
      "root" : "95bb7225f7e68f3e099d686fe0a73",
      "virtual" : "/",
      "real" : "C:\\Data_Privacy_GAT\\Testing_Tesseract_2"
    16:20:38,085 DEBUG [f.p.e.c.f.FsParserAbstract] indexing [C:\Data_Privacy_GAT\Te
    sting_Tesseract_2] content
    16:20:38,085 DEBUG [f.p.e.c.f.c.f.FileAbstractorFile] Listing local files from C
    16:20:38,100 TRACE [f.p.e.c.f.f.FsCrawlerUtil] Determining 'group' is skipped fo
    r file [C:\Data_Privacy_GAT\Testing_Tesseract_2\noisy.pdf] on [windows server 2
    012 r2]
    16:20:38,100 TRACE [f.p.e.c.f.f.FsCrawlerUtil] Determining 'group' is skipped fo
    r file [C:\Data_Privacy_GAT\Testing_Tesseract_2\noisy.pdf] on [windows server 2
    012 r2]
    16:20:38,100 DEBUG [f.p.e.c.f.c.f.FileAbstractorFile] 1 local files found
    16:20:38,100 TRACE [f.p.e.c.f.FsParserAbstract] FileAbstractModel = FileAbstract
    Model{name='noisy.pdf', file=true, directory=false, lastModifiedDate=2020-05-22T
    16:27:57.047518, creationDate=2020-05-28T16:12:41.781162, accessDate=2020-05-28T
    16:12:41.781162, path='C:\Data_Privacy_GAT\Testing_Tesseract_2', owner='IT-DAS\O
    uyang', group='null', permissions=-1, extension='pdf', fullpath='C:\Data_Privacy
    _GAT\Testing_Tesseract_2\noisy.pdf', size=43987}
    16:20:38,100 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(C:\Data_Pr
    ivacy_GAT\Testing_Tesseract_2, C:\Data_Privacy_GAT\Testing_Tesseract_2\noisy.pdf
    ) = /noisy.pdf
    16:20:38,100 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [
    /noisy.pdf], includes = [null], excludes = [[*/~*]]
    16:20:38,100 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/noisy.pdf], excludes
     = [[*/~*]]
    16:20:38,100 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [.*?/~.*?]
    16:20:38,100 TRACE [f.p.e.c.f.f.FsCrawlerUtil] does not match any exclude patter
    16:20:38,100 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/noisy.pdf], includes
     = [null]
    16:20:38,100 TRACE [f.p.e.c.f.f.FsCrawlerUtil] no include rules
    16:20:38,100 DEBUG [f.p.e.c.f.FsParserAbstract] [/noisy.pdf] can be indexed: [tr
    16:20:38,100 DEBUG [f.p.e.c.f.FsParserAbstract]   - file: /noisy.pdf
    16:20:38,100 DEBUG [f.p.e.c.f.FsParserAbstract] fetching content from [C:\Data_P
    16:20:38,100 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(C:\Data_Pr
    ivacy_GAT\Testing_Tesseract_2, C:\Data_Privacy_GAT\Testing_Tesseract_2\noisy.pdf
    ) = /noisy.pdf
    16:20:38,116 TRACE [f.p.e.c.f.t.TikaDocParser] Generating document [C:\Data_Priv
    16:20:38,116 TRACE [f.p.e.c.f.t.TikaDocParser] Beginning Tika extraction
    16:20:38,147 DEBUG [f.p.e.c.f.t.TikaInstance] OCR is activated.
    16:20:38,178 DEBUG [f.p.e.c.f.t.TikaInstance] OCR strategy for PDF documents is
    [ocr_and_text] and tesseract was found.
    16:20:38,506 WARN  [o.a.t.p.PDFParser] J2KImageReader not loaded. JPEG2000 files
     will not be processed.
    See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
    for optional dependencies.
 16:20:38,881 DEBUG [f.p.e.c.f.t.TikaInstance] OCR is activated so we need to con
    figure Tesseract in case we have specific settings.
    16:20:38,881 DEBUG [f.p.e.c.f.t.TikaInstance] Tesseract Language set to [eng].
    16:20:43,279 TRACE [f.p.e.c.f.t.TikaDocParser] End of Tika extraction
    16:20:44,265 TRACE [f.p.e.c.f.t.TikaDocParser] Main detected language: [en: HIGH
    16:20:47,553 TRACE [f.p.e.c.f.c.v.ElasticsearchClientV7] Sending a bulk request
    of [1] requests
    16:20:48,875 TRACE [f.p.e.c.f.t.TikaDocParser] Listing all available metadata:
    16:20:48,876 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw.entrySet(), iter
    16:20:48,877 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("date"
    , "2020-05-22T14:27:56Z"));
    16:20:48,877 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("pdf:u
    nmappedUnicodeCharsPerPage", "0"));
    16:20:48,878 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("pdf:P
    DFVersion", "1.7"));
    16:20:48,878 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("pdf:d
    ocinfo:title", "tesseract_header.jpg"));
    16:20:48,879 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("pdf:h
    asXFA", "false"));
    16:20:48,880 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("acces
    s_permission:modify_annotations", "true"));
    16:20:48,880 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("acces
    s_permission:can_print_degraded", "true"));
    16:20:48,881 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("dc:cr
    eator", "z0045ucs"));
    16:20:48,882 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("dcter
    ms:created", "2020-05-22T14:27:56Z"));
    16:20:48,882 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("Last-
    Modified", "2020-05-22T14:27:56Z"));
    16:20:48,883 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("dcter
    ms:modified", "2020-05-22T14:27:56Z"));
    16:20:48,884 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("dc:fo
    rmat", "application/pdf; version=1.7"));
    16:20:48,884 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("title
    ", "tesseract_header.jpg"));
    16:20:48,885 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("Last-
    Save-Date", "2020-05-22T14:27:56Z"));
    16:20:48,885 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("acces
    s_permission:fill_in_form", "true"));
    16:20:48,886 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("pdf:d
    ocinfo:modified", "2020-05-22T14:27:56Z"));
    16:20:48,887 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("meta:
    save-date", "2020-05-22T14:27:56Z"));
    16:20:48,887 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("pdf:e
    ncrypted", "false"));
    16:20:48,888 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("dc:ti
    tle", "tesseract_header.jpg"));
    16:20:48,888 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("modif
    ied", "2020-05-22T14:27:56Z"));
    16:20:48,889 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("pdf:h
    asMarkedContent", "false"));
    16:20:48,890 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("Conte
    nt-Type", "application/pdf"));
    16:20:48,890 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("pdf:d
    ocinfo:creator", "z0045ucs"));
    16:20:48,891 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("X-Par
    sed-By", "org.apache.tika.parser.pdf.PDFParser"));
    16:20:48,892 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("creat
    or", "z0045ucs"));
    16:20:48,893 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("meta:
    author", "z0045ucs"));
    16:20:48,894 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("meta:
    creation-date", "2020-05-22T14:27:56Z"));
    16:20:48,895 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("creat
    ed", "2020-05-22T14:27:56Z"));
    16:20:48,895 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("acces
    s_permission:extract_for_accessibility", "true"));
    16:20:48,896 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("acces
    s_permission:assemble_document", "true"));
    16:20:48,897 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("xmpTP
    g:NPages", "1"));
    16:20:48,898 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("Creat
    ion-Date", "2020-05-22T14:27:56Z"));
    16:20:48,899 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("resou
    rceName", "noisy.pdf"));
    16:20:48,900 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("pdf:h
    asXMP", "false"));
    16:20:48,900 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("pdf:c
    harsPerPage", "0"));
    16:20:48,901 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("acces
    s_permission:extract_content", "true"));
    16:20:48,902 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("acces
    s_permission:can_print", "true"));
    16:20:48,903 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("Autho
    r", "z0045ucs"));
    16:20:48,903 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("produ
    cer", "Microsoft: Print To PDF"));
    16:20:48,904 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("acces
    s_permission:can_modify", "true"));
    16:20:48,905 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("pdf:d
    ocinfo:producer", "Microsoft: Print To PDF"));
    16:20:48,906 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("pdf:d
    ocinfo:created", "2020-05-22T14:27:56Z"));
    16:20:48,907 TRACE [f.p.e.c.f.t.TikaDocParser] End document generation
    16:20:48,908 TRACE [f.p.e.c.f.f.FsCrawlerUtil] No pattern always matches.
    16:20:48,913 TRACE [f.p.e.c.f.c.v.ElasticsearchClientV7] Executed bulk request w
    ith [1] requests
    16:20:48,928 DEBUG [f.p.e.c.f.FsParserAbstract] Indexing testing_tesseract_v5/no
    16:20:48,928 TRACE [f.p.e.c.f.FsParserAbstract] JSon indexed : {
      "content" : "\n  \n\nCcv)malcolm:tesseract-python adrianrosebrock$ python ocr.
    py --image images/example_@1.png\nNoisy image\nto test\nOCS 1 1yeln edd\n\n
     \n   \n  \n\nNoisy image\nto test\nTesseract OCR\n\n  \n  \n\n\n",
      "meta" : {
        "author" : "z0045ucs",
        "title" : "tesseract_header.jpg",
        "date" : "2020-05-22T12:27:56.000+00:00",
        "language" : "en",
        "format" : "application/pdf; version=1.7",
        "created" : "2020-05-22T12:27:56.000+00:00",
        "raw" : {
          "date" : "2020-05-22T14:27:56Z",
          "pdf:unmappedUnicodeCharsPerPage" : "0",
          "pdf:PDFVersion" : "1.7",
          "pdf:docinfo:title" : "tesseract_header.jpg",
          "pdf:hasXFA" : "false",
          "access_permission:modify_annotations" : "true",
          "access_permission:can_print_degraded" : "true",
          "dc:creator" : "z0045ucs",
          "dcterms:created" : "2020-05-22T14:27:56Z",
          "Last-Modified" : "2020-05-22T14:27:56Z",
          "dcterms:modified" : "2020-05-22T14:27:56Z",
          "dc:format" : "application/pdf; version=1.7",
          "title" : "tesseract_header.jpg",
          "Last-Save-Date" : "2020-05-22T14:27:56Z",
          "access_permission:fill_in_form" : "true",
          "pdf:docinfo:modified" : "2020-05-22T14:27:56Z",
          "meta:save-date" : "2020-05-22T14:27:56Z",
          "pdf:encrypted" : "false",
          "dc:title" : "tesseract_header.jpg",
          "modified" : "2020-05-22T14:27:56Z",
          "pdf:hasMarkedContent" : "false",
          "Content-Type" : "application/pdf",
          "pdf:docinfo:creator" : "z0045ucs",
          "X-Parsed-By" : "org.apache.tika.parser.pdf.PDFParser",
          "creator" : "z0045ucs",
          "meta:author" : "z0045ucs",
          "meta:creation-date" : "2020-05-22T14:27:56Z",
          "created" : "2020-05-22T14:27:56Z",
          "access_permission:extract_for_accessibility" : "true",
          "access_permission:assemble_document" : "true",
          "xmpTPg:NPages" : "1",
          "Creation-Date" : "2020-05-22T14:27:56Z",
          "resourceName" : "noisy.pdf",
          "pdf:hasXMP" : "false",
          "pdf:charsPerPage" : "0",
          "access_permission:extract_content" : "true",
          "access_permission:can_print" : "true",
          "Author" : "z0045ucs",
          "producer" : "Microsoft: Print To PDF",
          "access_permission:can_modify" : "true",
          "pdf:docinfo:producer" : "Microsoft: Print To PDF",
          "pdf:docinfo:created" : "2020-05-22T14:27:56Z"
      "file" : {
        "extension" : "pdf",
        "content_type" : "application/pdf",
        "created" : "2020-05-28T14:12:41.781+00:00",
        "last_modified" : "2020-05-22T14:27:57.047+00:00",
        "last_accessed" : "2020-05-28T14:12:41.781+00:00",
        "indexing_date" : "2020-05-28T14:20:38.100+00:00",
        "filesize" : 43987,
        "filename" : "noisy.pdf",
        "url" : "file://C:\\Data_Privacy_GAT\\Testing_Tesseract_2\\noisy.pdf"
      "path" : {
        "root" : "501a70282ead4e6535ce27023b95d",
        "virtual" : "/noisy.pdf",
        "real" : "C:\\Data_Privacy_GAT\\Testing_Tesseract_2\\noisy.pdf"
    16:20:48,949 DEBUG [f.p.e.c.f.FsParserAbstract] Looking for removed files in [C:
    16:20:48,950 TRACE [f.p.e.c.f.FsParserAbstract] Querying elasticsearch for files
     in dir [path.root:501a70282ead4e6535ce27023b95d]
    16:20:48,986 TRACE [f.p.e.c.f.FsParserAbstract] Response [fr.pilato.elasticsearc
    16:20:48,987 TRACE [f.p.e.c.f.FsParserAbstract] We found: []
    16:20:48,987 DEBUG [f.p.e.c.f.FsParserAbstract] Looking for removed directories
    in [C:\Data_Privacy_GAT\Testing_Tesseract_2]...
    16:20:49,001 DEBUG [f.p.e.c.f.FsParserAbstract] Fs crawler is going to sleep for
    16:20:58,884 TRACE [f.p.e.c.f.c.v.ElasticsearchClientV7] Sending a bulk request
    of [1] requests
    16:20:58,942 TRACE [f.p.e.c.f.c.v.ElasticsearchClientV7] Executed bulk request w
    ith [1] requests

Thanks for your help.

I can see some content extracted:

\n \n\nCcv)malcolm:tesseract-python adrianrosebrock$ python ocr.
py --image images/example_@1.png\nNoisy image\nto test\nOCS 1 1yeln edd\n\n
\n \n \n\nNoisy image\nto test\nTesseract OCR\n\n \n \n\n\n

Isn't what you are looking for? Could you share the noisy.pdf PDF document?

Hi David,

Yeah, you are right. The content was extracted! I don't understand why it didn't extract the text from my other directory though.

I created a new directory like you asked with 1 file and it worked.

But, when I added a new file into the directory, for some reason, it is not showing up or getting processed. I still just have 1 file in Kibana right now and its been about 30 minutes since I last put in a new pdf into the directory.

Do you know what the problem is with that?



If you are moving file from one dir to another it's likely possible that the modification date is older than the last run date of FSCrawler. Thus the file is not considered as new.

You can run fscrawler with the --restart option. It will simply ignore all file dates and will reindex everything.

I see. I did just move a file from another directory. When I copy and paste the file however, fscrawler was able to process it.

Thanks for your help, David.

If you're running on Linux you can also "touch" the file to make it appearing more recent than it is.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.