Tif files in fscrawler

avishai.d · May 10, 2020, 9:32am

Hello,
I'm trying to index a tif file using fscrawler and don't get any contet while PDF works fine.
Please assist

dadoonet · May 10, 2020, 1:10pm

Welcome!

Do you have ocr installed?
What are the full logs when you run with --trace option?

avishai.d · May 11, 2020, 6:38am

09:34:37,036 DEBUG [f.p.e.c.f.t.TikaInstance] OCR is activated.
09:34:37,042 DEBUG [f.p.e.c.f.t.TikaInstance] But Tesseract is not installed so we won't run OCR.
09:34:37,312 WARN [o.a.t.p.PDFParser] J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

09:34:37,634 DEBUG [f.p.e.c.f.t.TikaInstance] OCR is activated so we need to configure Tesseract in case we have specific settings.
09:34:37,635 DEBUG [f.p.e.c.f.t.TikaInstance] Tesseract Language set to [eng].
09:34:37,713 TRACE [f.p.e.c.f.t.TikaDocParser] End of Tika extraction
09:34:37,720 TRACE [f.p.e.c.f.t.TikaDocParser] End document generation
09:34:37,720 TRACE [f.p.e.c.f.f.FsCrawlerUtil] Null or empty content always matches.
09:34:37,731 DEBUG [f.p.e.c.f.FsParserAbstract] Indexing tiftest/a3c5b9966776fd4661a0967eb22e99e?pipeline=null

can you reffer me to the propper documantation please ?

dadoonet · May 11, 2020, 7:47am

I asked for the full logs. Is it possible to get them?

Please don't post unformatted code, logs, or configuration as it's very hard to read.

Instead, paste the text and format it with </> icon or pairs of triple backticks (```), and check the preview window to make sure it's properly formatted before posting it. This makes it more likely that your question will receive a useful answer.

https://fscrawler.readthedocs.io/en/latest/user/ocr.html

avishai.d · May 18, 2020, 11:03am

Hi David,
I have attached the output of the fscrawler trace output that consists of 1 PDF and 1 tif file.
Both files content returns empty.
Thank you in advance,
Avishai -

13:38:04,484 INFO  [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [1.8gb/26.6gb=6.85%], RAM [173.1gb/195.9gb=88.33%], Swap [196gb/223.9gb=87.51%].
13:38:04,500 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings.json] already exists
13:38:04,500 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings_folder.json] already exists
13:38:04,500 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [7/_settings.json] already exists
13:38:04,500 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [7/_settings_folder.json] already exists
13:38:04,500 DEBUG [f.p.e.c.f.c.FsCrawlerCli] Starting job [test0518]...
13:38:04,844 TRACE [f.p.e.c.f.c.FsCrawlerCli] settings used for this crawler: [---
name: "test0518"
fs:
  url: "\\tmp\\stg"
  update_rate: "15m"
  excludes:
  - "*/~*"
  json_support: false
  filename_as_id: false
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: false
  store_source: false
  index_content: true
  attributes_support: false
  raw_metadata: false
  xml_support: false
  index_folders: true
  lang_detect: false
  continue_on_error: false
  ocr:
    language: "heb"
    path: "/Program Files/Tesseract-OCR"
    data_path: "/Program Files/Tesseract-OCR/tessdata"
    enabled: true
    pdf_strategy: "ocr_and_text"
  follow_symlinks: false
elasticsearch:
  nodes:
  - url: "http://127.0.0.1:9200"
  bulk_size: 100
  flush_interval: "5s"
  byte_size: "10mb"
]
13:38:05,797 TRACE [f.p.e.c.f.f.FsCrawlerUtil] Determining 'group' is skipped for file [\tmp\stg\00102286.TIF]┬áon [windows server 2016]
13:38:05,797 TRACE [f.p.e.c.f.f.FsCrawlerUtil] Determining 'group' is skipped for file [\tmp\stg\00102286.TIF]┬áon [windows server 2016]
13:38:05,797 TRACE [f.p.e.c.f.f.FsCrawlerUtil] Determining 'group' is skipped for file [\tmp\stg\15857372.PDF]┬áon [windows server 2016]
13:38:05,797 TRACE [f.p.e.c.f.f.FsCrawlerUtil] Determining 'group' is skipped for file [\tmp\stg\15857372.PDF]┬áon [windows server 2016]
13:38:05,797 DEBUG [f.p.e.c.f.c.f.FileAbstractorFile] 2 local files found
13:38:05,797 TRACE [f.p.e.c.f.FsParserAbstract] FileAbstractModel = FileAbstractModel{name='00102286.TIF', file=true, directory=false, lastModifiedDate=2004-01-13T16:55:24, creationDate=2020-05-17T15:18:06.590180, accessDate=2020-05-17T15:18:06.590180, path='\tmp\stg', owner='Me\***', group='null', permissions=-1, extension='tif', fullpath='C:\tmp\stg\00102286.TIF', size=220752}
13:38:05,797 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(\tmp\stg, \tmp\stg\00102286.TIF) = /00102286.TIF
13:38:05,797 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [/00102286.TIF], includes = [null], excludes = [[*/~*]]
13:38:05,797 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/00102286.TIF], excludes = [[*/~*]]
13:38:05,797 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [.*?/~.*?]
13:38:05,797 TRACE [f.p.e.c.f.f.FsCrawlerUtil] does not match any exclude pattern
13:38:05,797 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/00102286.TIF], includes = [null]
13:38:05,797 TRACE [f.p.e.c.f.f.FsCrawlerUtil] no include rules
13:38:05,797 DEBUG [f.p.e.c.f.FsParserAbstract] [/00102286.TIF] can be indexed: [true]
13:38:05,813 DEBUG [f.p.e.c.f.FsParserAbstract]   - file: /00102286.TIF
13:38:05,813 DEBUG [f.p.e.c.f.FsParserAbstract]     - not modified: creation date 2020-05-17T15:18:06.590180 , file date 2004-01-13T16:55:24, last scan date 2020-05-18T13:36:02.345
13:38:05,813 TRACE [f.p.e.c.f.FsParserAbstract] FileAbstractModel = FileAbstractModel{name='15857372.PDF', file=true, directory=false, lastModifiedDate=2019-01-23T19:32:37.677090, creationDate=2020-05-17T15:55:14.183595, accessDate=2020-05-17T15:55:14.183595, path='\tmp\stg', owner='Me\***', group='null', permissions=-1, extension='pdf', fullpath='C:\tmp\stg\15857372.PDF', size=608949}
13:38:05,813 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(\tmp\stg, \tmp\stg\15857372.PDF) = /15857372.PDF
13:38:05,813 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [/15857372.PDF], includes = [null], excludes = [[*/~*]]
13:38:05,813 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/15857372.PDF], excludes = [[*/~*]]
13:38:05,813 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [.*?/~.*?]
13:38:05,813 TRACE [f.p.e.c.f.f.FsCrawlerUtil] does not match any exclude pattern
13:38:05,813 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/15857372.PDF], includes = [null]
13:38:05,813 TRACE [f.p.e.c.f.f.FsCrawlerUtil] no include rules
13:38:05,813 DEBUG [f.p.e.c.f.FsParserAbstract] [/15857372.PDF] can be indexed: [true]
13:38:05,813 DEBUG [f.p.e.c.f.FsParserAbstract]   - file: /15857372.PDF
13:38:05,829 DEBUG [f.p.e.c.f.FsParserAbstract]     - not modified: creation date 2020-05-17T15:55:14.183595 , file date 2019-01-23T19:32:37.677090, last scan date 2020-05-18T13:36:02.345
13:38:05,829 DEBUG [f.p.e.c.f.FsParserAbstract] Looking for removed files in [\tmp\stg]...
13:38:05,829 TRACE [f.p.e.c.f.FsParserAbstract] Querying elasticsearch for files in dir [path.root:6c7bd4f3b29617bb2da3d3ffdbdaf7]
13:38:05,876 TRACE [f.p.e.c.f.FsParserAbstract] Response [fr.pilato.elasticsearch.crawler.fs.client.ESSearchResponse@5948b091]
13:38:05,876 TRACE [f.p.e.c.f.FsParserAbstract] We found: [00102286.TIF, 15857372.PDF]
13:38:05,876 TRACE [f.p.e.c.f.FsParserAbstract] Checking file [00102286.TIF]
13:38:05,876 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(\tmp\stg, \tmp\stg\00102286.TIF) = /00102286.TIF
13:38:05,891 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [/00102286.TIF], includes = [null], excludes = [[*/~*]]
13:38:05,891 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/00102286.TIF], excludes = [[*/~*]]
13:38:05,891 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [.*?/~.*?]
13:38:05,891 TRACE [f.p.e.c.f.f.FsCrawlerUtil] does not match any exclude pattern
13:38:05,891 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/00102286.TIF], includes = [null]
13:38:05,891 TRACE [f.p.e.c.f.f.FsCrawlerUtil] no include rules
13:38:05,891 TRACE [f.p.e.c.f.FsParserAbstract] Checking file [15857372.PDF]
13:38:05,891 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(\tmp\stg, \tmp\stg\15857372.PDF) = /15857372.PDF
13:38:05,891 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [/15857372.PDF], includes = [null], excludes = [[*/~*]]
13:38:05,891 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/15857372.PDF], excludes = [[*/~*]]
13:38:05,891 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [.*?/~.*?]
13:38:05,891 TRACE [f.p.e.c.f.f.FsCrawlerUtil] does not match any exclude pattern
13:38:05,891 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/15857372.PDF], includes = [null]
13:38:05,891 TRACE [f.p.e.c.f.f.FsCrawlerUtil] no include rules
13:38:05,891 DEBUG [f.p.e.c.f.FsParserAbstract] Looking for removed directories in [\tmp\stg]...
13:38:05,907 DEBUG [f.p.e.c.f.FsParserAbstract] Fs crawler is going to sleep for 15m
13:53:05,967 DEBUG [f.p.e.c.f.FsParserAbstract] Fs crawler is now waking up again...
13:53:05,967 DEBUG [f.p.e.c.f.FsParserAbstract] Fs crawler thread [test0518] is now running. Run #2...
13:53:05,967 DEBUG [f.p.e.c.f.FsParserAbstract] indexing [\tmp\stg] content
13:53:05,967 DEBUG [f.p.e.c.f.c.f.FileAbstractorFile] Listing local files from \tmp\stg
13:53:05,967 TRACE [f.p.e.c.f.f.FsCrawlerUtil] Determining 'group' is skipped for file [\tmp\stg\00102286.TIF]┬áon [windows server 2016]
13:53:05,967 TRACE [f.p.e.c.f.f.FsCrawlerUtil] Determining 'group' is skipped for file [\tmp\stg\00102286.TIF]┬áon [windows server 2016]
13:53:05,967 TRACE [f.p.e.c.f.f.FsCrawlerUtil] Determining 'group' is skipped for file [\tmp\stg\15857372.PDF]┬áon [windows server 2016]
13:53:05,967 TRACE [f.p.e.c.f.f.FsCrawlerUtil] Determining 'group' is skipped for file [\tmp\stg\15857372.PDF]┬áon [windows server 2016]
13:53:05,967 DEBUG [f.p.e.c.f.c.f.FileAbstractorFile] 2 local files found
13:53:05,967 TRACE [f.p.e.c.f.FsParserAbstract] FileAbstractModel = FileAbstractModel{name='00102286.TIF', file=true, directory=false, lastModifiedDate=2004-01-13T16:55:24, creationDate=2020-05-17T15:18:06.590180, accessDate=2020-05-17T15:18:06.590180, path='\tmp\stg', owner='Me\***', group='null', permissions=-1, extension='tif', fullpath='C:\tmp\stg\00102286.TIF', size=220752}
13:53:05,983 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(\tmp\stg, \tmp\stg\00102286.TIF) = /00102286.TIF
13:53:05,983 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [/00102286.TIF], includes = [null], excludes = [[*/~*]]
13:53:05,983 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/00102286.TIF], excludes = [[*/~*]]
13:53:05,983 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [.*?/~.*?]
13:53:05,983 TRACE [f.p.e.c.f.f.FsCrawlerUtil] does not match any exclude pattern
13:53:05,983 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/00102286.TIF], includes = [null]
13:53:05,983 TRACE [f.p.e.c.f.f.FsCrawlerUtil] no include rules
13:53:05,983 DEBUG [f.p.e.c.f.FsParserAbstract] [/00102286.TIF] can be indexed: [true]
13:53:05,983 DEBUG [f.p.e.c.f.FsParserAbstract]   - file: /00102286.TIF
13:53:05,983 DEBUG [f.p.e.c.f.FsParserAbstract]     - not modified: creation date 2020-05-17T15:18:06.590180 , file date 2004-01-13T16:55:24, last scan date 2020-05-18T13:38:03.782
13:53:05,983 TRACE [f.p.e.c.f.FsParserAbstract] FileAbstractModel = FileAbstractModel{name='15857372.PDF', file=true, directory=false, lastModifiedDate=2019-01-23T19:32:37.677090, creationDate=2020-05-17T15:55:14.183595, accessDate=2020-05-17T15:55:14.183595, path='\tmp\stg', owner='Me\***', group='null', permissions=-1, extension='pdf', fullpath='C:\tmp\stg\15857372.PDF', size=608949}
13:53:05,983 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(\tmp\stg, \tmp\stg\15857372.PDF) = /15857372.PDF
13:53:05,983 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [/15857372.PDF], includes = [null], excludes = [[*/~*]]
13:53:05,983 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/15857372.PDF], excludes = [[*/~*]]
13:53:05,983 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [.*?/~.*?]
13:53:05,983 TRACE [f.p.e.c.f.f.FsCrawlerUtil] does not match any exclude pattern
13:53:05,983 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/15857372.PDF], includes = [null]
13:53:05,983 TRACE [f.p.e.c.f.f.FsCrawlerUtil] no include rules
13:53:05,999 DEBUG [f.p.e.c.f.FsParserAbstract] [/15857372.PDF] can be indexed: [true]
13:53:05,999 DEBUG [f.p.e.c.f.FsParserAbstract]   - file: /15857372.PDF
13:53:05,999 DEBUG [f.p.e.c.f.FsParserAbstract]     - not modified: creation date 2020-05-17T15:55:14.183595 , file date 2019-01-23T19:32:37.677090, last scan date 2020-05-18T13:38:03.782
13:53:05,999 DEBUG [f.p.e.c.f.FsParserAbstract] Looking for removed files in [\tmp\stg]...
13:53:05,999 TRACE [f.p.e.c.f.FsParserAbstract] Querying elasticsearch for files in dir [path.root:6c7bd4f3b29617bb2da3d3ffdbdaf7]
13:53:05,999 TRACE [f.p.e.c.f.FsParserAbstract] Response [fr.pilato.elasticsearch.crawler.fs.client.ESSearchResponse@37fe5ddb]
13:53:05,999 TRACE [f.p.e.c.f.FsParserAbstract] We found: [00102286.TIF, 15857372.PDF]
13:53:05,999 TRACE [f.p.e.c.f.FsParserAbstract] Checking file [00102286.TIF]
13:53:05,999 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(\tmp\stg, \tmp\stg\00102286.TIF) = /00102286.TIF
13:53:05,999 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [/00102286.TIF], includes = [null], excludes = [[*/~*]]
13:53:05,999 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/00102286.TIF], excludes = [[*/~*]]
13:53:05,999 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [.*?/~.*?]
13:53:06,014 TRACE [f.p.e.c.f.f.FsCrawlerUtil] does not match any exclude pattern
13:53:06,014 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/00102286.TIF], includes = [null]
13:53:06,014 TRACE [f.p.e.c.f.f.FsCrawlerUtil] no include rules
13:53:06,014 TRACE [f.p.e.c.f.FsParserAbstract] Checking file [15857372.PDF]
13:53:06,014 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(\tmp\stg, \tmp\stg\15857372.PDF) = /15857372.PDF
13:53:06,014 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [/15857372.PDF], includes = [null], excludes = [[*/~*]]
13:53:06,014 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/15857372.PDF], excludes = [[*/~*]]
13:53:06,014 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [.*?/~.*?]
13:53:06,014 TRACE [f.p.e.c.f.f.FsCrawlerUtil] does not match any exclude pattern
13:53:06,014 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/15857372.PDF], includes = [null]
13:53:06,014 TRACE [f.p.e.c.f.f.FsCrawlerUtil] no include rules
13:53:06,014 DEBUG [f.p.e.c.f.FsParserAbstract] Looking for removed directories in [\tmp\stg]...
13:53:06,030 DEBUG [f.p.e.c.f.FsParserAbstract] Fs crawler is going to sleep for 15m

dadoonet · May 18, 2020, 11:17am

Could you do the same thing but with the --restart option as FSCrawler checked the dates here and did not find a new file?

avishai.d · May 18, 2020, 12:10pm

It helped with the tif file but not with the PDF one.
Here is the output -

15:04:38,969 INFO  [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [1.8gb/26.6gb=6.85%], RAM [173.1gb/195.9gb=88.32%], Swap [196gb/223.9gb=87.5%].
15:04:38,985 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings.json] already exists
15:04:38,986 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings_folder.json] already exists
15:04:38,986 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [7/_settings.json] already exists
15:04:38,986 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [7/_settings_folder.json] already exists
15:04:38,986 DEBUG [f.p.e.c.f.c.FsCrawlerCli] Cleaning existing status for job [test0518]...
15:04:38,986 DEBUG [f.p.e.c.f.c.FsCrawlerCli] Starting job [test0518]...
15:04:39,328 TRACE [f.p.e.c.f.c.FsCrawlerCli] settings used for this crawler: [---
name: "test0518"
fs:
  url: "\\tmp\\stg"
  update_rate: "15m"
  excludes:
  - "*/~*"
  json_support: false
  filename_as_id: false
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: false
  store_source: false
  index_content: true
  attributes_support: false
  raw_metadata: false
  xml_support: false
  index_folders: true
  lang_detect: false
  continue_on_error: false
  ocr:
    language: "heb"
    path: "/Program Files/Tesseract-OCR"
    data_path: "/Program Files/Tesseract-OCR/tessdata"
    enabled: true
    pdf_strategy: "ocr_and_text"
  follow_symlinks: false
elasticsearch:
  nodes:
  - url: "http://127.0.0.1:9200"
  bulk_size: 100
  flush_interval: "5s"
  byte_size: "10mb"
]
15:04:40,235 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV7] create index [test0518_folder]
15:04:40,235 TRACE [f.p.e.c.f.c.v.ElasticsearchClientV7] index settings: [{
  "settings": {
    "analysis": {
      "analyzer": {
        "fscrawler_path": {
          "tokenizer": "fscrawler_path"
        }
      },
      "tokenizer": {
        "fscrawler_path": {
          "type": "path_hierarchy"
        }
      }
    }
  },
  "mappings": {
    "properties" : {
      "real" : {
        "type" : "keyword",
        "store" : true
      },
      "root" : {
        "type" : "keyword",
        "store" : true
      },
      "virtual" : {
        "type" : "keyword",
        "store" : true
      }
    }
  }
}
]
15:04:40,235 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV7] wait for yellow health on index [test0518_folder]
15:04:40,250 TRACE [f.p.e.c.f.c.v.ElasticsearchClientV7] health response: {"cluster_name":"Me","status":"yellow","timed_out":false,"number_of_nodes":1,"number_of_data_nodes":1,"active_primary_shards":1,"active_shards":1,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":1,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":54.166666666666664}
15:04:40,250 DEBUG [f.p.e.c.f.FsParserAbstract] creating fs crawler thread [test0518] for [\tmp\stg] every [15m]
15:04:40,250 INFO  [f.p.e.c.f.FsParserAbstract] FS crawler started for [test0518] for [\tmp\stg] every [15m]
15:04:40,250 DEBUG [f.p.e.c.f.FsParserAbstract] Fs crawler thread [test0518] is now running. Run #1...
15:04:40,266 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(\tmp\stg, \tmp\stg) = /
15:04:40,266 DEBUG [f.p.e.c.f.FsParserAbstract] Indexing test0518_folder/6c7bd4f3b29617bb2da3d3ffdbdaf7?pipeline=null
15:04:40,266 TRACE [f.p.e.c.f.FsParserAbstract] JSon indexed : {
  "root" : "a1ba3d554a8a89c16d758b29eaff9953",
  "virtual" : "/",
  "real" : "\\tmp\\stg"
}
15:04:40,266 DEBUG [f.p.e.c.f.FsParserAbstract] indexing [\tmp\stg] content
15:04:40,266 DEBUG [f.p.e.c.f.c.f.FileAbstractorFile] Listing local files from \tmp\stg
15:04:40,281 TRACE [f.p.e.c.f.f.FsCrawlerUtil] Determining 'group' is skipped for file [\tmp\stg\00102286.TIF]┬áon [windows server 2016]
15:04:40,281 TRACE [f.p.e.c.f.f.FsCrawlerUtil] Determining 'group' is skipped for file [\tmp\stg\00102286.TIF]┬áon [windows server 2016]
15:04:40,281 TRACE [f.p.e.c.f.f.FsCrawlerUtil] Determining 'group' is skipped for file [\tmp\stg\15857372.PDF]┬áon [windows server 2016]
15:04:40,281 TRACE [f.p.e.c.f.f.FsCrawlerUtil] Determining 'group' is skipped for file [\tmp\stg\15857372.PDF]┬áon [windows server 2016]
15:04:40,281 DEBUG [f.p.e.c.f.c.f.FileAbstractorFile] 2 local files found
15:04:40,281 TRACE [f.p.e.c.f.FsParserAbstract] FileAbstractModel = FileAbstractModel{name='00102286.TIF', file=true, directory=false, lastModifiedDate=2004-01-13T16:55:24, creationDate=2020-05-17T15:18:06.590180, accessDate=2020-05-17T15:18:06.590180, path='\tmp\stg', owner='Me\***', group='null', permissions=-1, extension='tif', fullpath='C:\tmp\stg\00102286.TIF', size=220752}
15:04:40,281 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(\tmp\stg, \tmp\stg\00102286.TIF) = /00102286.TIF
15:04:40,281 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [/00102286.TIF], includes = [null], excludes = [[*/~*]]
15:04:40,281 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/00102286.TIF], excludes = [[*/~*]]
15:04:40,281 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [.*?/~.*?]
15:04:40,281 TRACE [f.p.e.c.f.f.FsCrawlerUtil] does not match any exclude pattern
15:04:40,281 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/00102286.TIF], includes = [null]
15:04:40,281 TRACE [f.p.e.c.f.f.FsCrawlerUtil] no include rules
15:04:40,281 DEBUG [f.p.e.c.f.FsParserAbstract] [/00102286.TIF] can be indexed: [true]
15:04:40,281 DEBUG [f.p.e.c.f.FsParserAbstract]   - file: /00102286.TIF
15:04:40,297 DEBUG [f.p.e.c.f.FsParserAbstract] fetching content from [\tmp\stg],[00102286.TIF]
15:04:40,303 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(\tmp\stg, \tmp\stg\00102286.TIF) = /00102286.TIF
15:04:40,305 TRACE [f.p.e.c.f.t.TikaDocParser] Generating document [\tmp\stg\00102286.TIF]
15:04:40,312 TRACE [f.p.e.c.f.t.TikaDocParser] Beginning Tika extraction
15:04:40,312 DEBUG [f.p.e.c.f.t.TikaInstance] OCR is activated.
15:04:40,328 DEBUG [f.p.e.c.f.t.TikaInstance] But Tesseract is not installed so we won't run OCR.
15:04:40,575 WARN  [o.a.t.p.PDFParser] J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

15:04:40,859 DEBUG [f.p.e.c.f.t.TikaInstance] OCR is activated so we need to configure Tesseract in case we have specific settings.
15:04:40,859 DEBUG [f.p.e.c.f.t.TikaInstance] Tesseract Path set to [/Program Files/Tesseract-OCR].
15:04:40,859 DEBUG [f.p.e.c.f.t.TikaInstance] Tesseract Data Path set to [/Program Files/Tesseract-OCR/tessdata].
15:04:40,859 DEBUG [f.p.e.c.f.t.TikaInstance] Tesseract Language set to [heb].
15:04:45,031 TRACE [f.p.e.c.f.c.v.ElasticsearchClientV7] Sending a bulk request of [1] requests
15:04:45,047 TRACE [f.p.e.c.f.c.v.ElasticsearchClientV7] Executed bulk request with [1] requests
15:04:59,687 TRACE [f.p.e.c.f.t.TikaDocParser] End of Tika extraction
15:04:59,708 TRACE [f.p.e.c.f.t.TikaDocParser] End document generation
15:04:59,708 TRACE [f.p.e.c.f.f.FsCrawlerUtil] No pattern always matches.
15:04:59,721 DEBUG [f.p.e.c.f.FsParserAbstract] Indexing test0518/7d7e5f4becfde4f8741314423b05667?pipeline=null
15:04:59,721 TRACE [f.p.e.c.f.FsParserAbstract] JSon indexed : {
  "content" : "╫ס\n\n8...",
  "meta" : { },
  "file" : {
    "extension" : "tif",
    "content_type" : "image/tiff",
    "created" : "2020-05-17T12:18:06.590+0000",
    "last_modified" : "2004-01-13T14:55:24.000+0000",
    "last_accessed" : "2020-05-17T12:18:06.590+0000",
    "indexing_date" : "2020-05-18T12:04:40.303+0000",
    "filesize" : 220752,
    "filename" : "00102286.TIF",
    "url" : "file://\\tmp\\stg\\00102286.TIF"
  },
  "path" : {
    "root" : "6c7bd4f3b29617bb2da3d3ffdbdaf7",
    "virtual" : "/00102286.TIF",
    "real" : "\\tmp\\stg\\00102286.TIF"
  }
}
15:04:59,728 TRACE [f.p.e.c.f.FsParserAbstract] FileAbstractModel = FileAbstractModel{name='15857372.PDF', file=true, directory=false, lastModifiedDate=2019-01-23T19:32:37.677090, creationDate=2020-05-17T15:55:14.183595, accessDate=2020-05-17T15:55:14.183595, path='\tmp\stg', owner='Me\***', group='null', permissions=-1, extension='pdf', fullpath='C:\tmp\stg\15857372.PDF', size=608949}
15:04:59,729 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(\tmp\stg, \tmp\stg\15857372.PDF) = /15857372.PDF
15:04:59,730 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [/15857372.PDF], includes = [null], excludes = [[*/~*]]
15:04:59,731 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/15857372.PDF], excludes = [[*/~*]]
15:04:59,732 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [.*?/~.*?]
15:04:59,733 TRACE [f.p.e.c.f.f.FsCrawlerUtil] does not match any exclude pattern
15:04:59,749 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/15857372.PDF], includes = [null]
15:04:59,749 TRACE [f.p.e.c.f.f.FsCrawlerUtil] no include rules
15:04:59,749 DEBUG [f.p.e.c.f.FsParserAbstract] [/15857372.PDF] can be indexed: [true]
15:04:59,749 DEBUG [f.p.e.c.f.FsParserAbstract]   - file: /15857372.PDF
15:04:59,749 DEBUG [f.p.e.c.f.FsParserAbstract] fetching content from [\tmp\stg],[15857372.PDF]
15:04:59,749 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(\tmp\stg, \tmp\stg\15857372.PDF) = /15857372.PDF
15:04:59,749 TRACE [f.p.e.c.f.t.TikaDocParser] Generating document [\tmp\stg\15857372.PDF]
15:04:59,749 TRACE [f.p.e.c.f.t.TikaDocParser] Beginning Tika extraction
15:05:00,015 TRACE [f.p.e.c.f.t.TikaDocParser] End of Tika extraction
15:05:00,015 TRACE [f.p.e.c.f.t.TikaDocParser] End document generation
15:05:00,015 TRACE [f.p.e.c.f.f.FsCrawlerUtil] No pattern always matches.
15:05:00,015 DEBUG [f.p.e.c.f.FsParserAbstract] Indexing test0518/b2fed9ec73554588e881dfa47e1404c?pipeline=null
15:05:00,015 TRACE [f.p.e.c.f.FsParserAbstract] JSon indexed : {
  "content" : "\n\n\n\n",
  "meta" : {
    "format" : "application/pdf; version=1.3"
  },
  "file" : {
    "extension" : "pdf",
    "content_type" : "application/pdf",
    "created" : "2020-05-17T12:55:14.183+0000",
    "last_modified" : "2019-01-23T17:32:37.677+0000",
    "last_accessed" : "2020-05-17T12:55:14.183+0000",
    "indexing_date" : "2020-05-18T12:04:59.749+0000",
    "filesize" : 608949,
    "filename" : "15857372.PDF",
    "url" : "file://\\tmp\\stg\\15857372.PDF"
  },
  "path" : {
    "root" : "6c7bd4f3b29617bb2da3d3ffdbdaf7",
    "virtual" : "/15857372.PDF",
    "real" : "\\tmp\\stg\\15857372.PDF"
  }
}
15:05:00,015 DEBUG [f.p.e.c.f.FsParserAbstract] Looking for removed files in [\tmp\stg]...
15:05:00,015 TRACE [f.p.e.c.f.FsParserAbstract] Querying elasticsearch for files in dir [path.root:6c7bd4f3b29617bb2da3d3ffdbdaf7]
15:05:00,062 TRACE [f.p.e.c.f.c.v.ElasticsearchClientV7] Sending a bulk request of [2] requests
15:05:00,078 TRACE [f.p.e.c.f.FsParserAbstract] Response [fr.pilato.elasticsearch.crawler.fs.client.ESSearchResponse@8dbe287]
15:05:00,078 TRACE [f.p.e.c.f.FsParserAbstract] We found: [00102286.TIF, 15857372.PDF]
15:05:00,078 TRACE [f.p.e.c.f.FsParserAbstract] Checking file [00102286.TIF]
15:05:00,078 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(\tmp\stg, \tmp\stg\00102286.TIF) = /00102286.TIF
15:05:00,078 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [/00102286.TIF], includes = [null], excludes = [[*/~*]]
15:05:00,078 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/00102286.TIF], excludes = [[*/~*]]
15:05:00,078 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [.*?/~.*?]
15:05:00,078 TRACE [f.p.e.c.f.f.FsCrawlerUtil] does not match any exclude pattern
15:05:00,078 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/00102286.TIF], includes = [null]
15:05:00,078 TRACE [f.p.e.c.f.c.v.ElasticsearchClientV7] Executed bulk request with [2] requests
15:05:00,078 TRACE [f.p.e.c.f.f.FsCrawlerUtil] no include rules
15:05:00,093 TRACE [f.p.e.c.f.FsParserAbstract] Checking file [15857372.PDF]
15:05:00,093 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(\tmp\stg, \tmp\stg\15857372.PDF) = /15857372.PDF
15:05:00,093 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [/15857372.PDF], includes = [null], excludes = [[*/~*]]
15:05:00,093 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/15857372.PDF], excludes = [[*/~*]]
15:05:00,093 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [.*?/~.*?]
15:05:00,093 TRACE [f.p.e.c.f.f.FsCrawlerUtil] does not match any exclude pattern
15:05:00,093 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/15857372.PDF], includes = [null]
15:05:00,093 TRACE [f.p.e.c.f.f.FsCrawlerUtil] no include rules
15:05:00,093 DEBUG [f.p.e.c.f.FsParserAbstract] Looking for removed directories in [\tmp\stg]...
15:05:00,812 DEBUG [f.p.e.c.f.FsParserAbstract] Fs crawler is going to sleep for 15m

dadoonet · May 18, 2020, 2:17pm

I'm confused by this message:

15:04:40,328 DEBUG [f.p.e.c.f.t.TikaInstance] But Tesseract is not installed so we won't run OCR.

I think that this might be incorrect:

    path: "/Program Files/Tesseract-OCR"

I think it should be something like:

    path: "/Program Files/Tesseract-OCR/tesseract.exe"

avishai.d · May 19, 2020, 6:13am

Didn't work.. still getting "But Tesseract is not installed so we won't run OCR"
any other idea why ?
When I try executing Tesseract by itself ( not throguh fscrawler ) I'm getting this error
Prehaps tessarct not supports ocr PDF ?

C:\Tesseract-OCR>tesseract 15857372.PDF out
Tesseract Open Source OCR Engine v5.0.0-alpha.20200328 with Leptonica
Error in pixReadStream: Pdf reading is not supported
Error in pixRead: pix not read
Error during processing.

dadoonet · May 19, 2020, 7:03am

May be with a dir name without space in it?
Is FSCrawler running on the same drive?

avishai.d · May 19, 2020, 7:20am

Yes it is on the same drive and tried it not in Program files folder - still didn't work.
I guess tesseract doesn't support pdf's ( see the error I got attached in previous thread )

https://coptr.digipres.org/Tesseract-ocr

"

Any image readable by Leptonica is supported in Tesseract including BMP, PNM, PNG, JFIF, JPEG, and TIFF

"

dadoonet · May 19, 2020, 1:40pm

Tesseract does not support PDF. But Tika actually extracts the images from the PDF and send them to Tesseract.

It works on my laptop at least.

Could you share with me a PDF file so I can test extraction locally?

avishai.d · May 20, 2020, 5:50am

Intersting.. so it needs to work
Unfortunately those documents are confidential so I can't share them - any other information I can pass ?

dadoonet · May 20, 2020, 7:47am

Could you try with this document?

avishai.d · May 20, 2020, 10:10am

This one is good - the text is indexed
Still my other pdfs isn't..

dadoonet · May 20, 2020, 10:15am

I can't help more without any concrete example. If you could try to find a similar document which is not classified and share it, that would help.

At least we can see that OCR seems to be well configured.

avishai.d · May 20, 2020, 10:38am

There is no way of attaching PDF here..
I got an example how can I attach it ? ( authorized extensions: jpg, jpeg, png, gif )

dadoonet · May 20, 2020, 1:14pm

Use another binary upload site of your choice. Or dropbox, box, gdrive...

avishai.d · May 21, 2020, 7:20am

Ther it is -

avishai.d · May 24, 2020, 6:08am

Hi David,
Any luck ?

Topic		Replies	Views
Tesseract-OCR only returns new lines Elasticsearch	10	1465	June 25, 2020
Read image text from pdf Elasticsearch	54	5377	June 7, 2017
I regret that I couldn't find an OCR parser to handle image/ocr-png.Please set the OCR_STRATEGY to NO_OCR or configure yourOCR parser correctly Elasticsearch	9	1354	April 10, 2022
Could not see OCR text in "content" field Elasticsearch	19	1337	August 3, 2020
Can we index images with extension types like .jpeg,.img,jpg in elasticsearch? Elasticsearch	35	3651	October 25, 2019

Tif files in fscrawler

Related topics