Tesseract-OCR only returns new lines

morningkaren · May 22, 2020, 8:36pm

I am using Tesseract-OCR to extract text from pdfs with fscrawler. I have tried configuring _settings.yaml with the path and data_path for Tesseract and also tried it without the path and data_path. However, when I use Kibana to look at the content of the OCR-ed pdfs, I just get new lines or empty space.

I also used jpgs and pngs to test if Tesseract can OCR those files, but fscrawler is not even reading jpgs or pngs. No content is extracted from jpgs or pngs.

On Python, I used pytesseract to check if Tesseract works on a jpg file with the same image as the pdfs I fed to fscrawler and pytesseract was able to pick up the text.

My questions are:

Is Tesseract-OCR just not working with fscrawler and how can I tell? (The command prompt logs seem fine. There is no error saying that OCR is not going to be performed.)
Do I need to convert the pdfs to jpgs first for Tesseract to work and if that is the case, what should I do about the issue that fscrawler is not reading jpgs?

Thanks for your help!

Karen

dadoonet · May 28, 2020, 1:30pm

Welcome!

It's probably a misconfiguration or something like this.

Is Tesseract-OCR just not working with fscrawler and how can I tell? (The command prompt logs seem fine. There is no error saying that OCR is not going to be performed.)

Could you share you FSCrawler job settings, run FSCrawler with --trace on a directory which contains only one file (pdf ou png) and share the full logs?

Do I need to convert the pdfs to jpgs first for Tesseract to work and if that is the case, what should I do about the issue that fscrawler is not reading jpgs?

no. This should work. I do have some integration tests which shows that this is working.
If you still can't make it work, could you share an example of a PDF document that can be reused in tests?

morningkaren · May 28, 2020, 2:30pm

Hi David,

Thanks for your response. Below is the FSCrawler job settings:

    fs:
      url: "C:\\Data_Privacy_GAT\\Testing_Tesseract_2"
      update_rate: "15m"
      excludes:
      - "*/~*"
      json_support: false
      filename_as_id: true
      add_filesize: true
      remove_deleted: true
      add_as_inner_object: false
      store_source: false
      index_content: true
      attributes_support: false
      raw_metadata: true
      xml_support: false
      index_folders: true
      lang_detect: true
      continue_on_error: true
      ocr:
        language: "eng"
        enabled: true
        pdf_strategy: "ocr_and_text"
        follow_symlinks: false

And I will paste the trace in another post because I would exceed character limit.

morningkaren · May 28, 2020, 2:32pm

Here is the trace. I will paste in two comments.

    ng_shards":0,"initializing_shards":0,"unassigned_shards":4,"delayed_unassigned_s
    hards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_wai
    ting_in_queue_millis":0,"active_shards_percent_as_number":44.44444444444444}
    16:20:37,756 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV7] create index [testing_t
    esseract_v5_folder]
    16:20:37,756 TRACE [f.p.e.c.f.c.v.ElasticsearchClientV7] index settings: [{
      "settings": {
        "analysis": {
          "analyzer": {
            "fscrawler_path": {
              "tokenizer": "fscrawler_path"
            }
          },
          "tokenizer": {
            "fscrawler_path": {
              "type": "path_hierarchy"
            }
          }
        }
      },
      "mappings": {
        "properties" : {
          "real" : {
            "type" : "keyword",
            "store" : true
          },
          "root" : {
            "type" : "keyword",
            "store" : true
          },
          "virtual" : {
            "type" : "keyword",
            "store" : true
          }
        }
      }
    }
    ]
    16:20:38,053 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV7] wait for yellow health
    on index [testing_tesseract_v5_folder]
    16:20:38,053 TRACE [f.p.e.c.f.c.v.ElasticsearchClientV7] health response: {"clus
    ter_name":"elasticsearch","status":"yellow","timed_out":false,"number_of_nodes":
    1,"number_of_data_nodes":1,"active_primary_shards":1,"active_shards":1,"relocati
    ng_shards":0,"initializing_shards":0,"unassigned_shards":1,"delayed_unassigned_s
    hards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_wai
    ting_in_queue_millis":0,"active_shards_percent_as_number":44.51219512195122}
    16:20:38,053 DEBUG [f.p.e.c.f.FsParserAbstract] creating fs crawler thread [test
    ing_tesseract_v5] for [C:\Data_Privacy_GAT\Testing_Tesseract_2] every [15m]
    16:20:38,053 INFO  [f.p.e.c.f.FsParserAbstract] FS crawler started for [testing_
    tesseract_v5] for [C:\Data_Privacy_GAT\Testing_Tesseract_2] every [15m]
    16:20:38,069 DEBUG [f.p.e.c.f.FsParserAbstract] Fs crawler thread [testing_tesse
    ract_v5] is now running. Run #1...
    16:20:38,069 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(C:\Data_Pr
    ivacy_GAT\Testing_Tesseract_2, C:\Data_Privacy_GAT\Testing_Tesseract_2) = /
    16:20:38,085 DEBUG [f.p.e.c.f.FsParserAbstract] Indexing testing_tesseract_v5_fo
    lder/501a70282ead4e6535ce27023b95d?pipeline=null
    16:20:38,085 TRACE [f.p.e.c.f.FsParserAbstract] JSon indexed : {
      "root" : "95bb7225f7e68f3e099d686fe0a73",
      "virtual" : "/",
      "real" : "C:\\Data_Privacy_GAT\\Testing_Tesseract_2"
    }
    16:20:38,085 DEBUG [f.p.e.c.f.FsParserAbstract] indexing [C:\Data_Privacy_GAT\Te
    sting_Tesseract_2] content
    16:20:38,085 DEBUG [f.p.e.c.f.c.f.FileAbstractorFile] Listing local files from C
    :\Data_Privacy_GAT\Testing_Tesseract_2
    16:20:38,100 TRACE [f.p.e.c.f.f.FsCrawlerUtil] Determining 'group' is skipped fo
    r file [C:\Data_Privacy_GAT\Testing_Tesseract_2\noisy.pdf]┬áon [windows server 2
    012 r2]
    16:20:38,100 TRACE [f.p.e.c.f.f.FsCrawlerUtil] Determining 'group' is skipped fo
    r file [C:\Data_Privacy_GAT\Testing_Tesseract_2\noisy.pdf]┬áon [windows server 2
    012 r2]
    16:20:38,100 DEBUG [f.p.e.c.f.c.f.FileAbstractorFile] 1 local files found
    16:20:38,100 TRACE [f.p.e.c.f.FsParserAbstract] FileAbstractModel = FileAbstract
    Model{name='noisy.pdf', file=true, directory=false, lastModifiedDate=2020-05-22T
    16:27:57.047518, creationDate=2020-05-28T16:12:41.781162, accessDate=2020-05-28T
    16:12:41.781162, path='C:\Data_Privacy_GAT\Testing_Tesseract_2', owner='IT-DAS\O
    uyang', group='null', permissions=-1, extension='pdf', fullpath='C:\Data_Privacy
    _GAT\Testing_Tesseract_2\noisy.pdf', size=43987}
    16:20:38,100 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(C:\Data_Pr
    ivacy_GAT\Testing_Tesseract_2, C:\Data_Privacy_GAT\Testing_Tesseract_2\noisy.pdf
    ) = /noisy.pdf
    16:20:38,100 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [
    /noisy.pdf], includes = [null], excludes = [[*/~*]]
    16:20:38,100 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/noisy.pdf], excludes
     = [[*/~*]]
    16:20:38,100 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [.*?/~.*?]
    16:20:38,100 TRACE [f.p.e.c.f.f.FsCrawlerUtil] does not match any exclude patter
    n
    16:20:38,100 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/noisy.pdf], includes
     = [null]
    16:20:38,100 TRACE [f.p.e.c.f.f.FsCrawlerUtil] no include rules
    16:20:38,100 DEBUG [f.p.e.c.f.FsParserAbstract] [/noisy.pdf] can be indexed: [tr
    ue]
    16:20:38,100 DEBUG [f.p.e.c.f.FsParserAbstract]   - file: /noisy.pdf
    16:20:38,100 DEBUG [f.p.e.c.f.FsParserAbstract] fetching content from [C:\Data_P
    rivacy_GAT\Testing_Tesseract_2],[noisy.pdf]
    16:20:38,100 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(C:\Data_Pr
    ivacy_GAT\Testing_Tesseract_2, C:\Data_Privacy_GAT\Testing_Tesseract_2\noisy.pdf
    ) = /noisy.pdf
    16:20:38,116 TRACE [f.p.e.c.f.t.TikaDocParser] Generating document [C:\Data_Priv
    acy_GAT\Testing_Tesseract_2\noisy.pdf]
    16:20:38,116 TRACE [f.p.e.c.f.t.TikaDocParser] Beginning Tika extraction
    16:20:38,147 DEBUG [f.p.e.c.f.t.TikaInstance] OCR is activated.
    16:20:38,178 DEBUG [f.p.e.c.f.t.TikaInstance] OCR strategy for PDF documents is
    [ocr_and_text] and tesseract was found.
    16:20:38,506 WARN  [o.a.t.p.PDFParser] J2KImageReader not loaded. JPEG2000 files
     will not be processed.
    See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
    for optional dependencies.

morningkaren · May 28, 2020, 2:33pm

 16:20:38,881 DEBUG [f.p.e.c.f.t.TikaInstance] OCR is activated so we need to con
    figure Tesseract in case we have specific settings.
    16:20:38,881 DEBUG [f.p.e.c.f.t.TikaInstance] Tesseract Language set to [eng].
    16:20:43,279 TRACE [f.p.e.c.f.t.TikaDocParser] End of Tika extraction
    16:20:44,265 TRACE [f.p.e.c.f.t.TikaDocParser] Main detected language: [en: HIGH
     (0.999994)]
    16:20:47,553 TRACE [f.p.e.c.f.c.v.ElasticsearchClientV7] Sending a bulk request
    of [1] requests
    16:20:48,875 TRACE [f.p.e.c.f.t.TikaDocParser] Listing all available metadata:
    16:20:48,876 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw.entrySet(), iter
    ableWithSize(42));
    16:20:48,877 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("date"
    , "2020-05-22T14:27:56Z"));
    16:20:48,877 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("pdf:u
    nmappedUnicodeCharsPerPage", "0"));
    16:20:48,878 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("pdf:P
    DFVersion", "1.7"));
    16:20:48,878 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("pdf:d
    ocinfo:title", "tesseract_header.jpg"));
    16:20:48,879 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("pdf:h
    asXFA", "false"));
    16:20:48,880 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("acces
    s_permission:modify_annotations", "true"));
    16:20:48,880 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("acces
    s_permission:can_print_degraded", "true"));
    16:20:48,881 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("dc:cr
    eator", "z0045ucs"));
    16:20:48,882 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("dcter
    ms:created", "2020-05-22T14:27:56Z"));
    16:20:48,882 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("Last-
    Modified", "2020-05-22T14:27:56Z"));
    16:20:48,883 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("dcter
    ms:modified", "2020-05-22T14:27:56Z"));
    16:20:48,884 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("dc:fo
    rmat", "application/pdf; version=1.7"));
    16:20:48,884 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("title
    ", "tesseract_header.jpg"));
    16:20:48,885 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("Last-
    Save-Date", "2020-05-22T14:27:56Z"));
    16:20:48,885 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("acces
    s_permission:fill_in_form", "true"));
    16:20:48,886 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("pdf:d
    ocinfo:modified", "2020-05-22T14:27:56Z"));
    16:20:48,887 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("meta:
    save-date", "2020-05-22T14:27:56Z"));
    16:20:48,887 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("pdf:e
    ncrypted", "false"));
    16:20:48,888 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("dc:ti
    tle", "tesseract_header.jpg"));
    16:20:48,888 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("modif
    ied", "2020-05-22T14:27:56Z"));
    16:20:48,889 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("pdf:h
    asMarkedContent", "false"));
    16:20:48,890 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("Conte
    nt-Type", "application/pdf"));
    16:20:48,890 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("pdf:d
    ocinfo:creator", "z0045ucs"));
    16:20:48,891 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("X-Par
    sed-By", "org.apache.tika.parser.pdf.PDFParser"));
    16:20:48,892 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("creat
    or", "z0045ucs"));
    16:20:48,893 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("meta:
    author", "z0045ucs"));
    16:20:48,894 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("meta:
    creation-date", "2020-05-22T14:27:56Z"));
    16:20:48,895 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("creat
    ed", "2020-05-22T14:27:56Z"));
    16:20:48,895 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("acces
    s_permission:extract_for_accessibility", "true"));
    16:20:48,896 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("acces
    s_permission:assemble_document", "true"));
    16:20:48,897 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("xmpTP
    g:NPages", "1"));
    16:20:48,898 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("Creat
    ion-Date", "2020-05-22T14:27:56Z"));
    16:20:48,899 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("resou
    rceName", "noisy.pdf"));
    16:20:48,900 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("pdf:h
    asXMP", "false"));
    16:20:48,900 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("pdf:c
    harsPerPage", "0"));
    16:20:48,901 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("acces
    s_permission:extract_content", "true"));
    16:20:48,902 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("acces
    s_permission:can_print", "true"));
    16:20:48,903 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("Autho
    r", "z0045ucs"));
    16:20:48,903 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("produ
    cer", "Microsoft: Print To PDF"));
    16:20:48,904 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("acces
    s_permission:can_modify", "true"));
    16:20:48,905 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("pdf:d
    ocinfo:producer", "Microsoft: Print To PDF"));
    16:20:48,906 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("pdf:d
    ocinfo:created", "2020-05-22T14:27:56Z"));
    16:20:48,907 TRACE [f.p.e.c.f.t.TikaDocParser] End document generation
    16:20:48,908 TRACE [f.p.e.c.f.f.FsCrawlerUtil] No pattern always matches.
    16:20:48,913 TRACE [f.p.e.c.f.c.v.ElasticsearchClientV7] Executed bulk request w
    ith [1] requests
    16:20:48,928 DEBUG [f.p.e.c.f.FsParserAbstract] Indexing testing_tesseract_v5/no
    isy.pdf?pipeline=null
    16:20:48,928 TRACE [f.p.e.c.f.FsParserAbstract] JSon indexed : {
      "content" : "\n  \n\nCcv)malcolm:tesseract-python adrianrosebrock$ python ocr.
    py --image images/example_@1.png\nNoisy image\nto test\nOCS 1 1yeln edd\n\n
     \n   \n  \n\nNoisy image\nto test\nTesseract OCR\n\n  \n  \n\n\n",
      "meta" : {
        "author" : "z0045ucs",
        "title" : "tesseract_header.jpg",
        "date" : "2020-05-22T12:27:56.000+00:00",
        "language" : "en",
        "format" : "application/pdf; version=1.7",
        "created" : "2020-05-22T12:27:56.000+00:00",
        "raw" : {
          "date" : "2020-05-22T14:27:56Z",
          "pdf:unmappedUnicodeCharsPerPage" : "0",
          "pdf:PDFVersion" : "1.7",
          "pdf:docinfo:title" : "tesseract_header.jpg",
          "pdf:hasXFA" : "false",
          "access_permission:modify_annotations" : "true",
          "access_permission:can_print_degraded" : "true",
          "dc:creator" : "z0045ucs",
          "dcterms:created" : "2020-05-22T14:27:56Z",
          "Last-Modified" : "2020-05-22T14:27:56Z",
          "dcterms:modified" : "2020-05-22T14:27:56Z",
          "dc:format" : "application/pdf; version=1.7",
          "title" : "tesseract_header.jpg",
          "Last-Save-Date" : "2020-05-22T14:27:56Z",
          "access_permission:fill_in_form" : "true",
          "pdf:docinfo:modified" : "2020-05-22T14:27:56Z",
          "meta:save-date" : "2020-05-22T14:27:56Z",
          "pdf:encrypted" : "false",
          "dc:title" : "tesseract_header.jpg",
          "modified" : "2020-05-22T14:27:56Z",
          "pdf:hasMarkedContent" : "false",
          "Content-Type" : "application/pdf",
          "pdf:docinfo:creator" : "z0045ucs",
          "X-Parsed-By" : "org.apache.tika.parser.pdf.PDFParser",
          "creator" : "z0045ucs",
          "meta:author" : "z0045ucs",
          "meta:creation-date" : "2020-05-22T14:27:56Z",
          "created" : "2020-05-22T14:27:56Z",
          "access_permission:extract_for_accessibility" : "true",
          "access_permission:assemble_document" : "true",
          "xmpTPg:NPages" : "1",
          "Creation-Date" : "2020-05-22T14:27:56Z",
          "resourceName" : "noisy.pdf",
          "pdf:hasXMP" : "false",
          "pdf:charsPerPage" : "0",
          "access_permission:extract_content" : "true",
          "access_permission:can_print" : "true",
          "Author" : "z0045ucs",
          "producer" : "Microsoft: Print To PDF",
          "access_permission:can_modify" : "true",
          "pdf:docinfo:producer" : "Microsoft: Print To PDF",
          "pdf:docinfo:created" : "2020-05-22T14:27:56Z"
        }
      },
      "file" : {
        "extension" : "pdf",
        "content_type" : "application/pdf",
        "created" : "2020-05-28T14:12:41.781+00:00",
        "last_modified" : "2020-05-22T14:27:57.047+00:00",
        "last_accessed" : "2020-05-28T14:12:41.781+00:00",
        "indexing_date" : "2020-05-28T14:20:38.100+00:00",
        "filesize" : 43987,
        "filename" : "noisy.pdf",
        "url" : "file://C:\\Data_Privacy_GAT\\Testing_Tesseract_2\\noisy.pdf"
      },
      "path" : {
        "root" : "501a70282ead4e6535ce27023b95d",
        "virtual" : "/noisy.pdf",
        "real" : "C:\\Data_Privacy_GAT\\Testing_Tesseract_2\\noisy.pdf"
      }
    }
    16:20:48,949 DEBUG [f.p.e.c.f.FsParserAbstract] Looking for removed files in [C:
    \Data_Privacy_GAT\Testing_Tesseract_2]...
    16:20:48,950 TRACE [f.p.e.c.f.FsParserAbstract] Querying elasticsearch for files
     in dir [path.root:501a70282ead4e6535ce27023b95d]
    16:20:48,986 TRACE [f.p.e.c.f.FsParserAbstract] Response [fr.pilato.elasticsearc
    h.crawler.fs.client.ESSearchResponse@592b622c]
    16:20:48,987 TRACE [f.p.e.c.f.FsParserAbstract] We found: []
    16:20:48,987 DEBUG [f.p.e.c.f.FsParserAbstract] Looking for removed directories
    in [C:\Data_Privacy_GAT\Testing_Tesseract_2]...
    16:20:49,001 DEBUG [f.p.e.c.f.FsParserAbstract] Fs crawler is going to sleep for
     15m
    16:20:58,884 TRACE [f.p.e.c.f.c.v.ElasticsearchClientV7] Sending a bulk request
    of [1] requests
    16:20:58,942 TRACE [f.p.e.c.f.c.v.ElasticsearchClientV7] Executed bulk request w
    ith [1] requests

Thanks for your help.

dadoonet · May 28, 2020, 3:54pm

I can see some content extracted:

\n \n\nCcv)malcolm:tesseract-python adrianrosebrock$ python ocr.
py --image images/example_@1.png\nNoisy image\nto test\nOCS 1 1yeln edd\n\n
\n \n \n\nNoisy image\nto test\nTesseract OCR\n\n \n \n\n\n

Isn't what you are looking for? Could you share the noisy.pdf PDF document?

morningkaren · May 28, 2020, 4:42pm

Hi David,

Yeah, you are right. The content was extracted! I don't understand why it didn't extract the text from my other directory though.

I created a new directory like you asked with 1 file and it worked.

But, when I added a new file into the directory, for some reason, it is not showing up or getting processed. I still just have 1 file in Kibana right now and its been about 30 minutes since I last put in a new pdf into the directory.

Do you know what the problem is with that?

Thanks,

Karen

dadoonet · May 28, 2020, 4:59pm

If you are moving file from one dir to another it's likely possible that the modification date is older than the last run date of FSCrawler. Thus the file is not considered as new.

You can run fscrawler with the --restart option. It will simply ignore all file dates and will reindex everything.

morningkaren · May 28, 2020, 5:23pm

I see. I did just move a file from another directory. When I copy and paste the file however, fscrawler was able to process it.

Thanks for your help, David.

dadoonet · May 28, 2020, 5:39pm

If you're running on Linux you can also "touch" the file to make it appearing more recent than it is.

system · June 25, 2020, 5:39pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Read image text from pdf Elasticsearch	54	5234	June 7, 2017
FSCrawler - OCR not working anymore in 2.9 without Tesseract location in PATH Elasticsearch	2	601	June 29, 2022
Can't see the text content in images that are inside pdf or word file Elasticsearch	2	325	June 5, 2019
FScrawler not parsing jpg in PDF Elasticsearch	8	1322	April 1, 2020
Not able to index content of images Elasticsearch	7	835	October 14, 2019

Tesseract-OCR only returns new lines

Related topics