Hello,
I'm trying to index a tif file using fscrawler and don't get any contet while PDF works fine.
Please assist
Welcome!
Do you have ocr installed?
What are the full logs when you run with --trace
option?
09:34:37,036 DEBUG [f.p.e.c.f.t.TikaInstance] OCR is activated.
09:34:37,042 DEBUG [f.p.e.c.f.t.TikaInstance] But Tesseract is not installed so we won't run OCR.
09:34:37,312 WARN [o.a.t.p.PDFParser] J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.
09:34:37,634 DEBUG [f.p.e.c.f.t.TikaInstance] OCR is activated so we need to configure Tesseract in case we have specific settings.
09:34:37,635 DEBUG [f.p.e.c.f.t.TikaInstance] Tesseract Language set to [eng].
09:34:37,713 TRACE [f.p.e.c.f.t.TikaDocParser] End of Tika extraction
09:34:37,720 TRACE [f.p.e.c.f.t.TikaDocParser] End document generation
09:34:37,720 TRACE [f.p.e.c.f.f.FsCrawlerUtil] Null or empty content always matches.
09:34:37,731 DEBUG [f.p.e.c.f.FsParserAbstract] Indexing tiftest/a3c5b9966776fd4661a0967eb22e99e?pipeline=null
can you reffer me to the propper documantation please ?
I asked for the full logs. Is it possible to get them?
Please don't post unformatted code, logs, or configuration as it's very hard to read.
Instead, paste the text and format it with </> icon or pairs of triple backticks (```), and check the preview window to make sure it's properly formatted before posting it. This makes it more likely that your question will receive a useful answer.
Hi David,
I have attached the output of the fscrawler trace output that consists of 1 PDF and 1 tif file.
Both files content returns empty.
Thank you in advance,
Avishai -
13:38:04,484 INFO [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [1.8gb/26.6gb=6.85%], RAM [173.1gb/195.9gb=88.33%], Swap [196gb/223.9gb=87.51%].
13:38:04,500 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings.json] already exists
13:38:04,500 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings_folder.json] already exists
13:38:04,500 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [7/_settings.json] already exists
13:38:04,500 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [7/_settings_folder.json] already exists
13:38:04,500 DEBUG [f.p.e.c.f.c.FsCrawlerCli] Starting job [test0518]...
13:38:04,844 TRACE [f.p.e.c.f.c.FsCrawlerCli] settings used for this crawler: [---
name: "test0518"
fs:
url: "\\tmp\\stg"
update_rate: "15m"
excludes:
- "*/~*"
json_support: false
filename_as_id: false
add_filesize: true
remove_deleted: true
add_as_inner_object: false
store_source: false
index_content: true
attributes_support: false
raw_metadata: false
xml_support: false
index_folders: true
lang_detect: false
continue_on_error: false
ocr:
language: "heb"
path: "/Program Files/Tesseract-OCR"
data_path: "/Program Files/Tesseract-OCR/tessdata"
enabled: true
pdf_strategy: "ocr_and_text"
follow_symlinks: false
elasticsearch:
nodes:
- url: "http://127.0.0.1:9200"
bulk_size: 100
flush_interval: "5s"
byte_size: "10mb"
]
13:38:05,797 TRACE [f.p.e.c.f.f.FsCrawlerUtil] Determining 'group' is skipped for file [\tmp\stg\00102286.TIF] on [windows server 2016]
13:38:05,797 TRACE [f.p.e.c.f.f.FsCrawlerUtil] Determining 'group' is skipped for file [\tmp\stg\00102286.TIF] on [windows server 2016]
13:38:05,797 TRACE [f.p.e.c.f.f.FsCrawlerUtil] Determining 'group' is skipped for file [\tmp\stg\15857372.PDF] on [windows server 2016]
13:38:05,797 TRACE [f.p.e.c.f.f.FsCrawlerUtil] Determining 'group' is skipped for file [\tmp\stg\15857372.PDF] on [windows server 2016]
13:38:05,797 DEBUG [f.p.e.c.f.c.f.FileAbstractorFile] 2 local files found
13:38:05,797 TRACE [f.p.e.c.f.FsParserAbstract] FileAbstractModel = FileAbstractModel{name='00102286.TIF', file=true, directory=false, lastModifiedDate=2004-01-13T16:55:24, creationDate=2020-05-17T15:18:06.590180, accessDate=2020-05-17T15:18:06.590180, path='\tmp\stg', owner='Me\***', group='null', permissions=-1, extension='tif', fullpath='C:\tmp\stg\00102286.TIF', size=220752}
13:38:05,797 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(\tmp\stg, \tmp\stg\00102286.TIF) = /00102286.TIF
13:38:05,797 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [/00102286.TIF], includes = [null], excludes = [[*/~*]]
13:38:05,797 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/00102286.TIF], excludes = [[*/~*]]
13:38:05,797 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [.*?/~.*?]
13:38:05,797 TRACE [f.p.e.c.f.f.FsCrawlerUtil] does not match any exclude pattern
13:38:05,797 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/00102286.TIF], includes = [null]
13:38:05,797 TRACE [f.p.e.c.f.f.FsCrawlerUtil] no include rules
13:38:05,797 DEBUG [f.p.e.c.f.FsParserAbstract] [/00102286.TIF] can be indexed: [true]
13:38:05,813 DEBUG [f.p.e.c.f.FsParserAbstract] - file: /00102286.TIF
13:38:05,813 DEBUG [f.p.e.c.f.FsParserAbstract] - not modified: creation date 2020-05-17T15:18:06.590180 , file date 2004-01-13T16:55:24, last scan date 2020-05-18T13:36:02.345
13:38:05,813 TRACE [f.p.e.c.f.FsParserAbstract] FileAbstractModel = FileAbstractModel{name='15857372.PDF', file=true, directory=false, lastModifiedDate=2019-01-23T19:32:37.677090, creationDate=2020-05-17T15:55:14.183595, accessDate=2020-05-17T15:55:14.183595, path='\tmp\stg', owner='Me\***', group='null', permissions=-1, extension='pdf', fullpath='C:\tmp\stg\15857372.PDF', size=608949}
13:38:05,813 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(\tmp\stg, \tmp\stg\15857372.PDF) = /15857372.PDF
13:38:05,813 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [/15857372.PDF], includes = [null], excludes = [[*/~*]]
13:38:05,813 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/15857372.PDF], excludes = [[*/~*]]
13:38:05,813 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [.*?/~.*?]
13:38:05,813 TRACE [f.p.e.c.f.f.FsCrawlerUtil] does not match any exclude pattern
13:38:05,813 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/15857372.PDF], includes = [null]
13:38:05,813 TRACE [f.p.e.c.f.f.FsCrawlerUtil] no include rules
13:38:05,813 DEBUG [f.p.e.c.f.FsParserAbstract] [/15857372.PDF] can be indexed: [true]
13:38:05,813 DEBUG [f.p.e.c.f.FsParserAbstract] - file: /15857372.PDF
13:38:05,829 DEBUG [f.p.e.c.f.FsParserAbstract] - not modified: creation date 2020-05-17T15:55:14.183595 , file date 2019-01-23T19:32:37.677090, last scan date 2020-05-18T13:36:02.345
13:38:05,829 DEBUG [f.p.e.c.f.FsParserAbstract] Looking for removed files in [\tmp\stg]...
13:38:05,829 TRACE [f.p.e.c.f.FsParserAbstract] Querying elasticsearch for files in dir [path.root:6c7bd4f3b29617bb2da3d3ffdbdaf7]
13:38:05,876 TRACE [f.p.e.c.f.FsParserAbstract] Response [fr.pilato.elasticsearch.crawler.fs.client.ESSearchResponse@5948b091]
13:38:05,876 TRACE [f.p.e.c.f.FsParserAbstract] We found: [00102286.TIF, 15857372.PDF]
13:38:05,876 TRACE [f.p.e.c.f.FsParserAbstract] Checking file [00102286.TIF]
13:38:05,876 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(\tmp\stg, \tmp\stg\00102286.TIF) = /00102286.TIF
13:38:05,891 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [/00102286.TIF], includes = [null], excludes = [[*/~*]]
13:38:05,891 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/00102286.TIF], excludes = [[*/~*]]
13:38:05,891 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [.*?/~.*?]
13:38:05,891 TRACE [f.p.e.c.f.f.FsCrawlerUtil] does not match any exclude pattern
13:38:05,891 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/00102286.TIF], includes = [null]
13:38:05,891 TRACE [f.p.e.c.f.f.FsCrawlerUtil] no include rules
13:38:05,891 TRACE [f.p.e.c.f.FsParserAbstract] Checking file [15857372.PDF]
13:38:05,891 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(\tmp\stg, \tmp\stg\15857372.PDF) = /15857372.PDF
13:38:05,891 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [/15857372.PDF], includes = [null], excludes = [[*/~*]]
13:38:05,891 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/15857372.PDF], excludes = [[*/~*]]
13:38:05,891 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [.*?/~.*?]
13:38:05,891 TRACE [f.p.e.c.f.f.FsCrawlerUtil] does not match any exclude pattern
13:38:05,891 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/15857372.PDF], includes = [null]
13:38:05,891 TRACE [f.p.e.c.f.f.FsCrawlerUtil] no include rules
13:38:05,891 DEBUG [f.p.e.c.f.FsParserAbstract] Looking for removed directories in [\tmp\stg]...
13:38:05,907 DEBUG [f.p.e.c.f.FsParserAbstract] Fs crawler is going to sleep for 15m
13:53:05,967 DEBUG [f.p.e.c.f.FsParserAbstract] Fs crawler is now waking up again...
13:53:05,967 DEBUG [f.p.e.c.f.FsParserAbstract] Fs crawler thread [test0518] is now running. Run #2...
13:53:05,967 DEBUG [f.p.e.c.f.FsParserAbstract] indexing [\tmp\stg] content
13:53:05,967 DEBUG [f.p.e.c.f.c.f.FileAbstractorFile] Listing local files from \tmp\stg
13:53:05,967 TRACE [f.p.e.c.f.f.FsCrawlerUtil] Determining 'group' is skipped for file [\tmp\stg\00102286.TIF] on [windows server 2016]
13:53:05,967 TRACE [f.p.e.c.f.f.FsCrawlerUtil] Determining 'group' is skipped for file [\tmp\stg\00102286.TIF] on [windows server 2016]
13:53:05,967 TRACE [f.p.e.c.f.f.FsCrawlerUtil] Determining 'group' is skipped for file [\tmp\stg\15857372.PDF] on [windows server 2016]
13:53:05,967 TRACE [f.p.e.c.f.f.FsCrawlerUtil] Determining 'group' is skipped for file [\tmp\stg\15857372.PDF] on [windows server 2016]
13:53:05,967 DEBUG [f.p.e.c.f.c.f.FileAbstractorFile] 2 local files found
13:53:05,967 TRACE [f.p.e.c.f.FsParserAbstract] FileAbstractModel = FileAbstractModel{name='00102286.TIF', file=true, directory=false, lastModifiedDate=2004-01-13T16:55:24, creationDate=2020-05-17T15:18:06.590180, accessDate=2020-05-17T15:18:06.590180, path='\tmp\stg', owner='Me\***', group='null', permissions=-1, extension='tif', fullpath='C:\tmp\stg\00102286.TIF', size=220752}
13:53:05,983 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(\tmp\stg, \tmp\stg\00102286.TIF) = /00102286.TIF
13:53:05,983 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [/00102286.TIF], includes = [null], excludes = [[*/~*]]
13:53:05,983 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/00102286.TIF], excludes = [[*/~*]]
13:53:05,983 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [.*?/~.*?]
13:53:05,983 TRACE [f.p.e.c.f.f.FsCrawlerUtil] does not match any exclude pattern
13:53:05,983 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/00102286.TIF], includes = [null]
13:53:05,983 TRACE [f.p.e.c.f.f.FsCrawlerUtil] no include rules
13:53:05,983 DEBUG [f.p.e.c.f.FsParserAbstract] [/00102286.TIF] can be indexed: [true]
13:53:05,983 DEBUG [f.p.e.c.f.FsParserAbstract] - file: /00102286.TIF
13:53:05,983 DEBUG [f.p.e.c.f.FsParserAbstract] - not modified: creation date 2020-05-17T15:18:06.590180 , file date 2004-01-13T16:55:24, last scan date 2020-05-18T13:38:03.782
13:53:05,983 TRACE [f.p.e.c.f.FsParserAbstract] FileAbstractModel = FileAbstractModel{name='15857372.PDF', file=true, directory=false, lastModifiedDate=2019-01-23T19:32:37.677090, creationDate=2020-05-17T15:55:14.183595, accessDate=2020-05-17T15:55:14.183595, path='\tmp\stg', owner='Me\***', group='null', permissions=-1, extension='pdf', fullpath='C:\tmp\stg\15857372.PDF', size=608949}
13:53:05,983 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(\tmp\stg, \tmp\stg\15857372.PDF) = /15857372.PDF
13:53:05,983 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [/15857372.PDF], includes = [null], excludes = [[*/~*]]
13:53:05,983 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/15857372.PDF], excludes = [[*/~*]]
13:53:05,983 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [.*?/~.*?]
13:53:05,983 TRACE [f.p.e.c.f.f.FsCrawlerUtil] does not match any exclude pattern
13:53:05,983 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/15857372.PDF], includes = [null]
13:53:05,983 TRACE [f.p.e.c.f.f.FsCrawlerUtil] no include rules
13:53:05,999 DEBUG [f.p.e.c.f.FsParserAbstract] [/15857372.PDF] can be indexed: [true]
13:53:05,999 DEBUG [f.p.e.c.f.FsParserAbstract] - file: /15857372.PDF
13:53:05,999 DEBUG [f.p.e.c.f.FsParserAbstract] - not modified: creation date 2020-05-17T15:55:14.183595 , file date 2019-01-23T19:32:37.677090, last scan date 2020-05-18T13:38:03.782
13:53:05,999 DEBUG [f.p.e.c.f.FsParserAbstract] Looking for removed files in [\tmp\stg]...
13:53:05,999 TRACE [f.p.e.c.f.FsParserAbstract] Querying elasticsearch for files in dir [path.root:6c7bd4f3b29617bb2da3d3ffdbdaf7]
13:53:05,999 TRACE [f.p.e.c.f.FsParserAbstract] Response [fr.pilato.elasticsearch.crawler.fs.client.ESSearchResponse@37fe5ddb]
13:53:05,999 TRACE [f.p.e.c.f.FsParserAbstract] We found: [00102286.TIF, 15857372.PDF]
13:53:05,999 TRACE [f.p.e.c.f.FsParserAbstract] Checking file [00102286.TIF]
13:53:05,999 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(\tmp\stg, \tmp\stg\00102286.TIF) = /00102286.TIF
13:53:05,999 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [/00102286.TIF], includes = [null], excludes = [[*/~*]]
13:53:05,999 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/00102286.TIF], excludes = [[*/~*]]
13:53:05,999 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [.*?/~.*?]
13:53:06,014 TRACE [f.p.e.c.f.f.FsCrawlerUtil] does not match any exclude pattern
13:53:06,014 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/00102286.TIF], includes = [null]
13:53:06,014 TRACE [f.p.e.c.f.f.FsCrawlerUtil] no include rules
13:53:06,014 TRACE [f.p.e.c.f.FsParserAbstract] Checking file [15857372.PDF]
13:53:06,014 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(\tmp\stg, \tmp\stg\15857372.PDF) = /15857372.PDF
13:53:06,014 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [/15857372.PDF], includes = [null], excludes = [[*/~*]]
13:53:06,014 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/15857372.PDF], excludes = [[*/~*]]
13:53:06,014 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [.*?/~.*?]
13:53:06,014 TRACE [f.p.e.c.f.f.FsCrawlerUtil] does not match any exclude pattern
13:53:06,014 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/15857372.PDF], includes = [null]
13:53:06,014 TRACE [f.p.e.c.f.f.FsCrawlerUtil] no include rules
13:53:06,014 DEBUG [f.p.e.c.f.FsParserAbstract] Looking for removed directories in [\tmp\stg]...
13:53:06,030 DEBUG [f.p.e.c.f.FsParserAbstract] Fs crawler is going to sleep for 15m
Could you do the same thing but with the --restart
option as FSCrawler checked the dates here and did not find a new file?
It helped with the tif file but not with the PDF one.
Here is the output -
15:04:38,969 INFO [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [1.8gb/26.6gb=6.85%], RAM [173.1gb/195.9gb=88.32%], Swap [196gb/223.9gb=87.5%].
15:04:38,985 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings.json] already exists
15:04:38,986 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings_folder.json] already exists
15:04:38,986 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [7/_settings.json] already exists
15:04:38,986 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [7/_settings_folder.json] already exists
15:04:38,986 DEBUG [f.p.e.c.f.c.FsCrawlerCli] Cleaning existing status for job [test0518]...
15:04:38,986 DEBUG [f.p.e.c.f.c.FsCrawlerCli] Starting job [test0518]...
15:04:39,328 TRACE [f.p.e.c.f.c.FsCrawlerCli] settings used for this crawler: [---
name: "test0518"
fs:
url: "\\tmp\\stg"
update_rate: "15m"
excludes:
- "*/~*"
json_support: false
filename_as_id: false
add_filesize: true
remove_deleted: true
add_as_inner_object: false
store_source: false
index_content: true
attributes_support: false
raw_metadata: false
xml_support: false
index_folders: true
lang_detect: false
continue_on_error: false
ocr:
language: "heb"
path: "/Program Files/Tesseract-OCR"
data_path: "/Program Files/Tesseract-OCR/tessdata"
enabled: true
pdf_strategy: "ocr_and_text"
follow_symlinks: false
elasticsearch:
nodes:
- url: "http://127.0.0.1:9200"
bulk_size: 100
flush_interval: "5s"
byte_size: "10mb"
]
15:04:40,235 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV7] create index [test0518_folder]
15:04:40,235 TRACE [f.p.e.c.f.c.v.ElasticsearchClientV7] index settings: [{
"settings": {
"analysis": {
"analyzer": {
"fscrawler_path": {
"tokenizer": "fscrawler_path"
}
},
"tokenizer": {
"fscrawler_path": {
"type": "path_hierarchy"
}
}
}
},
"mappings": {
"properties" : {
"real" : {
"type" : "keyword",
"store" : true
},
"root" : {
"type" : "keyword",
"store" : true
},
"virtual" : {
"type" : "keyword",
"store" : true
}
}
}
}
]
15:04:40,235 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV7] wait for yellow health on index [test0518_folder]
15:04:40,250 TRACE [f.p.e.c.f.c.v.ElasticsearchClientV7] health response: {"cluster_name":"Me","status":"yellow","timed_out":false,"number_of_nodes":1,"number_of_data_nodes":1,"active_primary_shards":1,"active_shards":1,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":1,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":54.166666666666664}
15:04:40,250 DEBUG [f.p.e.c.f.FsParserAbstract] creating fs crawler thread [test0518] for [\tmp\stg] every [15m]
15:04:40,250 INFO [f.p.e.c.f.FsParserAbstract] FS crawler started for [test0518] for [\tmp\stg] every [15m]
15:04:40,250 DEBUG [f.p.e.c.f.FsParserAbstract] Fs crawler thread [test0518] is now running. Run #1...
15:04:40,266 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(\tmp\stg, \tmp\stg) = /
15:04:40,266 DEBUG [f.p.e.c.f.FsParserAbstract] Indexing test0518_folder/6c7bd4f3b29617bb2da3d3ffdbdaf7?pipeline=null
15:04:40,266 TRACE [f.p.e.c.f.FsParserAbstract] JSon indexed : {
"root" : "a1ba3d554a8a89c16d758b29eaff9953",
"virtual" : "/",
"real" : "\\tmp\\stg"
}
15:04:40,266 DEBUG [f.p.e.c.f.FsParserAbstract] indexing [\tmp\stg] content
15:04:40,266 DEBUG [f.p.e.c.f.c.f.FileAbstractorFile] Listing local files from \tmp\stg
15:04:40,281 TRACE [f.p.e.c.f.f.FsCrawlerUtil] Determining 'group' is skipped for file [\tmp\stg\00102286.TIF] on [windows server 2016]
15:04:40,281 TRACE [f.p.e.c.f.f.FsCrawlerUtil] Determining 'group' is skipped for file [\tmp\stg\00102286.TIF] on [windows server 2016]
15:04:40,281 TRACE [f.p.e.c.f.f.FsCrawlerUtil] Determining 'group' is skipped for file [\tmp\stg\15857372.PDF] on [windows server 2016]
15:04:40,281 TRACE [f.p.e.c.f.f.FsCrawlerUtil] Determining 'group' is skipped for file [\tmp\stg\15857372.PDF] on [windows server 2016]
15:04:40,281 DEBUG [f.p.e.c.f.c.f.FileAbstractorFile] 2 local files found
15:04:40,281 TRACE [f.p.e.c.f.FsParserAbstract] FileAbstractModel = FileAbstractModel{name='00102286.TIF', file=true, directory=false, lastModifiedDate=2004-01-13T16:55:24, creationDate=2020-05-17T15:18:06.590180, accessDate=2020-05-17T15:18:06.590180, path='\tmp\stg', owner='Me\***', group='null', permissions=-1, extension='tif', fullpath='C:\tmp\stg\00102286.TIF', size=220752}
15:04:40,281 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(\tmp\stg, \tmp\stg\00102286.TIF) = /00102286.TIF
15:04:40,281 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [/00102286.TIF], includes = [null], excludes = [[*/~*]]
15:04:40,281 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/00102286.TIF], excludes = [[*/~*]]
15:04:40,281 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [.*?/~.*?]
15:04:40,281 TRACE [f.p.e.c.f.f.FsCrawlerUtil] does not match any exclude pattern
15:04:40,281 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/00102286.TIF], includes = [null]
15:04:40,281 TRACE [f.p.e.c.f.f.FsCrawlerUtil] no include rules
15:04:40,281 DEBUG [f.p.e.c.f.FsParserAbstract] [/00102286.TIF] can be indexed: [true]
15:04:40,281 DEBUG [f.p.e.c.f.FsParserAbstract] - file: /00102286.TIF
15:04:40,297 DEBUG [f.p.e.c.f.FsParserAbstract] fetching content from [\tmp\stg],[00102286.TIF]
15:04:40,303 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(\tmp\stg, \tmp\stg\00102286.TIF) = /00102286.TIF
15:04:40,305 TRACE [f.p.e.c.f.t.TikaDocParser] Generating document [\tmp\stg\00102286.TIF]
15:04:40,312 TRACE [f.p.e.c.f.t.TikaDocParser] Beginning Tika extraction
15:04:40,312 DEBUG [f.p.e.c.f.t.TikaInstance] OCR is activated.
15:04:40,328 DEBUG [f.p.e.c.f.t.TikaInstance] But Tesseract is not installed so we won't run OCR.
15:04:40,575 WARN [o.a.t.p.PDFParser] J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.
15:04:40,859 DEBUG [f.p.e.c.f.t.TikaInstance] OCR is activated so we need to configure Tesseract in case we have specific settings.
15:04:40,859 DEBUG [f.p.e.c.f.t.TikaInstance] Tesseract Path set to [/Program Files/Tesseract-OCR].
15:04:40,859 DEBUG [f.p.e.c.f.t.TikaInstance] Tesseract Data Path set to [/Program Files/Tesseract-OCR/tessdata].
15:04:40,859 DEBUG [f.p.e.c.f.t.TikaInstance] Tesseract Language set to [heb].
15:04:45,031 TRACE [f.p.e.c.f.c.v.ElasticsearchClientV7] Sending a bulk request of [1] requests
15:04:45,047 TRACE [f.p.e.c.f.c.v.ElasticsearchClientV7] Executed bulk request with [1] requests
15:04:59,687 TRACE [f.p.e.c.f.t.TikaDocParser] End of Tika extraction
15:04:59,708 TRACE [f.p.e.c.f.t.TikaDocParser] End document generation
15:04:59,708 TRACE [f.p.e.c.f.f.FsCrawlerUtil] No pattern always matches.
15:04:59,721 DEBUG [f.p.e.c.f.FsParserAbstract] Indexing test0518/7d7e5f4becfde4f8741314423b05667?pipeline=null
15:04:59,721 TRACE [f.p.e.c.f.FsParserAbstract] JSon indexed : {
"content" : "╫ס\n\n8...",
"meta" : { },
"file" : {
"extension" : "tif",
"content_type" : "image/tiff",
"created" : "2020-05-17T12:18:06.590+0000",
"last_modified" : "2004-01-13T14:55:24.000+0000",
"last_accessed" : "2020-05-17T12:18:06.590+0000",
"indexing_date" : "2020-05-18T12:04:40.303+0000",
"filesize" : 220752,
"filename" : "00102286.TIF",
"url" : "file://\\tmp\\stg\\00102286.TIF"
},
"path" : {
"root" : "6c7bd4f3b29617bb2da3d3ffdbdaf7",
"virtual" : "/00102286.TIF",
"real" : "\\tmp\\stg\\00102286.TIF"
}
}
15:04:59,728 TRACE [f.p.e.c.f.FsParserAbstract] FileAbstractModel = FileAbstractModel{name='15857372.PDF', file=true, directory=false, lastModifiedDate=2019-01-23T19:32:37.677090, creationDate=2020-05-17T15:55:14.183595, accessDate=2020-05-17T15:55:14.183595, path='\tmp\stg', owner='Me\***', group='null', permissions=-1, extension='pdf', fullpath='C:\tmp\stg\15857372.PDF', size=608949}
15:04:59,729 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(\tmp\stg, \tmp\stg\15857372.PDF) = /15857372.PDF
15:04:59,730 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [/15857372.PDF], includes = [null], excludes = [[*/~*]]
15:04:59,731 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/15857372.PDF], excludes = [[*/~*]]
15:04:59,732 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [.*?/~.*?]
15:04:59,733 TRACE [f.p.e.c.f.f.FsCrawlerUtil] does not match any exclude pattern
15:04:59,749 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/15857372.PDF], includes = [null]
15:04:59,749 TRACE [f.p.e.c.f.f.FsCrawlerUtil] no include rules
15:04:59,749 DEBUG [f.p.e.c.f.FsParserAbstract] [/15857372.PDF] can be indexed: [true]
15:04:59,749 DEBUG [f.p.e.c.f.FsParserAbstract] - file: /15857372.PDF
15:04:59,749 DEBUG [f.p.e.c.f.FsParserAbstract] fetching content from [\tmp\stg],[15857372.PDF]
15:04:59,749 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(\tmp\stg, \tmp\stg\15857372.PDF) = /15857372.PDF
15:04:59,749 TRACE [f.p.e.c.f.t.TikaDocParser] Generating document [\tmp\stg\15857372.PDF]
15:04:59,749 TRACE [f.p.e.c.f.t.TikaDocParser] Beginning Tika extraction
15:05:00,015 TRACE [f.p.e.c.f.t.TikaDocParser] End of Tika extraction
15:05:00,015 TRACE [f.p.e.c.f.t.TikaDocParser] End document generation
15:05:00,015 TRACE [f.p.e.c.f.f.FsCrawlerUtil] No pattern always matches.
15:05:00,015 DEBUG [f.p.e.c.f.FsParserAbstract] Indexing test0518/b2fed9ec73554588e881dfa47e1404c?pipeline=null
15:05:00,015 TRACE [f.p.e.c.f.FsParserAbstract] JSon indexed : {
"content" : "\n\n\n\n",
"meta" : {
"format" : "application/pdf; version=1.3"
},
"file" : {
"extension" : "pdf",
"content_type" : "application/pdf",
"created" : "2020-05-17T12:55:14.183+0000",
"last_modified" : "2019-01-23T17:32:37.677+0000",
"last_accessed" : "2020-05-17T12:55:14.183+0000",
"indexing_date" : "2020-05-18T12:04:59.749+0000",
"filesize" : 608949,
"filename" : "15857372.PDF",
"url" : "file://\\tmp\\stg\\15857372.PDF"
},
"path" : {
"root" : "6c7bd4f3b29617bb2da3d3ffdbdaf7",
"virtual" : "/15857372.PDF",
"real" : "\\tmp\\stg\\15857372.PDF"
}
}
15:05:00,015 DEBUG [f.p.e.c.f.FsParserAbstract] Looking for removed files in [\tmp\stg]...
15:05:00,015 TRACE [f.p.e.c.f.FsParserAbstract] Querying elasticsearch for files in dir [path.root:6c7bd4f3b29617bb2da3d3ffdbdaf7]
15:05:00,062 TRACE [f.p.e.c.f.c.v.ElasticsearchClientV7] Sending a bulk request of [2] requests
15:05:00,078 TRACE [f.p.e.c.f.FsParserAbstract] Response [fr.pilato.elasticsearch.crawler.fs.client.ESSearchResponse@8dbe287]
15:05:00,078 TRACE [f.p.e.c.f.FsParserAbstract] We found: [00102286.TIF, 15857372.PDF]
15:05:00,078 TRACE [f.p.e.c.f.FsParserAbstract] Checking file [00102286.TIF]
15:05:00,078 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(\tmp\stg, \tmp\stg\00102286.TIF) = /00102286.TIF
15:05:00,078 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [/00102286.TIF], includes = [null], excludes = [[*/~*]]
15:05:00,078 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/00102286.TIF], excludes = [[*/~*]]
15:05:00,078 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [.*?/~.*?]
15:05:00,078 TRACE [f.p.e.c.f.f.FsCrawlerUtil] does not match any exclude pattern
15:05:00,078 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/00102286.TIF], includes = [null]
15:05:00,078 TRACE [f.p.e.c.f.c.v.ElasticsearchClientV7] Executed bulk request with [2] requests
15:05:00,078 TRACE [f.p.e.c.f.f.FsCrawlerUtil] no include rules
15:05:00,093 TRACE [f.p.e.c.f.FsParserAbstract] Checking file [15857372.PDF]
15:05:00,093 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(\tmp\stg, \tmp\stg\15857372.PDF) = /15857372.PDF
15:05:00,093 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [/15857372.PDF], includes = [null], excludes = [[*/~*]]
15:05:00,093 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/15857372.PDF], excludes = [[*/~*]]
15:05:00,093 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [.*?/~.*?]
15:05:00,093 TRACE [f.p.e.c.f.f.FsCrawlerUtil] does not match any exclude pattern
15:05:00,093 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/15857372.PDF], includes = [null]
15:05:00,093 TRACE [f.p.e.c.f.f.FsCrawlerUtil] no include rules
15:05:00,093 DEBUG [f.p.e.c.f.FsParserAbstract] Looking for removed directories in [\tmp\stg]...
15:05:00,812 DEBUG [f.p.e.c.f.FsParserAbstract] Fs crawler is going to sleep for 15m
I'm confused by this message:
15:04:40,328 DEBUG [f.p.e.c.f.t.TikaInstance] But Tesseract is not installed so we won't run OCR.
I think that this might be incorrect:
path: "/Program Files/Tesseract-OCR"
I think it should be something like:
path: "/Program Files/Tesseract-OCR/tesseract.exe"
Didn't work.. still getting "But Tesseract is not installed so we won't run OCR"
any other idea why ?
When I try executing Tesseract by itself ( not throguh fscrawler ) I'm getting this error
Prehaps tessarct not supports ocr PDF ?
C:\Tesseract-OCR>tesseract 15857372.PDF out
Tesseract Open Source OCR Engine v5.0.0-alpha.20200328 with Leptonica
Error in pixReadStream: Pdf reading is not supported
Error in pixRead: pix not read
Error during processing.
May be with a dir name without space in it?
Is FSCrawler running on the same drive?
Yes it is on the same drive and tried it not in Program files folder - still didn't work.
I guess tesseract doesn't support pdf's ( see the error I got attached in previous thread )
https://coptr.digipres.org/Tesseract-ocr
"
Any image readable by Leptonica is supported in Tesseract including BMP, PNM, PNG, JFIF, JPEG, and TIFF
"
Tesseract does not support PDF. But Tika actually extracts the images from the PDF and send them to Tesseract.
It works on my laptop at least.
Could you share with me a PDF file so I can test extraction locally?
Intersting.. so it needs to work
Unfortunately those documents are confidential so I can't share them - any other information I can pass ?
Could you try with this document?
This one is good - the text is indexed
Still my other pdfs isn't..
I can't help more without any concrete example. If you could try to find a similar document which is not classified and share it, that would help.
At least we can see that OCR seems to be well configured.
There is no way of attaching PDF here..
I got an example how can I attach it ? ( authorized extensions: jpg, jpeg, png, gif )
Use another binary upload site of your choice. Or dropbox, box, gdrive...
Ther it is -
Hi David,
Any luck ?