13:30:54,074 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV7] wait for yellow health on index [test2_folder]
13:30:54,077 TRACE [f.p.e.c.f.c.v.ElasticsearchClientV7] health response: {"cluster_name":"elasticsearch","status":"yellow","timed_out":false,"number_of_nodes":1,"number_of_data_nodes":1,"active_primary_shards":1,"active_shards":1,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":1,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":61.111111111111114}
13:30:54,080 DEBUG [f.p.e.c.f.FsParserAbstract] creating fs crawler thread [test2] for [/home/local/es] every [2m]
13:30:54,080 INFO [f.p.e.c.f.FsParserAbstract] FS crawler started for [test2] for [/home/local/es] every [2m]
13:30:54,080 DEBUG [f.p.e.c.f.FsParserAbstract] Fs crawler thread [test2] is now running. Run #1...
13:30:54,088 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/home/local/es, /home/local/es) = /
13:30:54,091 DEBUG [f.p.e.c.f.FsParserAbstract] Indexing test2_folder/8a4fb678d07a821dbc3abeb4aeb337a?pipeline=null
13:30:54,091 TRACE [f.p.e.c.f.FsParserAbstract] JSon indexed : {
"root" : "bd7d7b2d64521cab5e6db3ff54a0711c",
"virtual" : "/",
"real" : "/home/local/es"
}
13:30:54,094 DEBUG [f.p.e.c.f.FsParserAbstract] indexing [/home/local/es] content
13:30:54,094 DEBUG [f.p.e.c.f.c.f.FileAbstractorFile] Listing local files from /home/local/es
13:30:54,099 DEBUG [f.p.e.c.f.c.f.FileAbstractorFile] 3 local files found
13:30:54,100 TRACE [f.p.e.c.f.FsParserAbstract] FileAbstractModel = FileAbstractModel{name='test-ocr.pdf', file=true, directory=false, lastModifiedDate=2020-06-25T12:13:16.999936, creationDate=2020-06-25T12:13:16.999936, accessDate=2020-06-25T12:13:16.999936, path='/home/local/es', owner='', group='', permissions=777, extension='pdf', fullpath='/home/local/es/test-ocr.pdf', size=112983}
13:30:54,100 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/home/local/es, /home/local/es/test-ocr.pdf) = /test-ocr.pdf
13:30:54,100 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [/test-ocr.pdf], includes = [null], excludes = [[*/~*]]
13:30:54,100 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/test-ocr.pdf], excludes = [[*/~*]]
13:30:54,100 TRACE [f.p.e.c.f.f.FsCrawlerUtil] regex is [.*?/~.*?]
13:30:54,101 TRACE [f.p.e.c.f.f.FsCrawlerUtil] does not match any exclude pattern
13:30:54,101 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/test-ocr.pdf], includes = [null]
13:30:54,101 TRACE [f.p.e.c.f.f.FsCrawlerUtil] no include rules
13:30:54,101 DEBUG [f.p.e.c.f.FsParserAbstract] [/test-ocr.pdf] can be indexed: [true]
13:30:54,101 DEBUG [f.p.e.c.f.FsParserAbstract] - file: /test-ocr.pdf
13:30:54,101 DEBUG [f.p.e.c.f.FsParserAbstract] fetching content from [/home/local/es],[test-ocr.pdf]
13:30:54,104 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/home/local/es, /home/local/es/test-ocr.pdf) = /test-ocr.pdf
13:30:54,105 TRACE [f.p.e.c.f.t.TikaDocParser] Generating document [/home/local/es/test-ocr.pdf]
13:30:54,113 TRACE [f.p.e.c.f.t.TikaDocParser] Beginning Tika extraction
13:30:54,128 DEBUG [f.p.e.c.f.t.TikaInstance] OCR is activated.
13:30:54,164 DEBUG [f.p.e.c.f.t.TikaInstance] OCR strategy for PDF documents is [ocr_and_text] and tesseract was found.
13:30:54,472 WARN [o.a.t.p.PDFParser] J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.
13:30:54,755 DEBUG [f.p.e.c.f.t.TikaInstance] OCR is activated so we need to configure Tesseract in case we have specific settings.
13:30:54,756 DEBUG [f.p.e.c.f.t.TikaInstance] Tesseract Path set to [/home/local/bin].
13:30:54,756 DEBUG [f.p.e.c.f.t.TikaInstance] Tesseract Data Path set to [/home/local/share/tessdata].
13:30:54,756 DEBUG [f.p.e.c.f.t.TikaInstance] Tesseract Language set to [eng].
13:30:56,277 TRACE [f.p.e.c.f.t.TikaDocParser] End of Tika extraction
13:30:56,285 TRACE [f.p.e.c.f.t.TikaDocParser] End document generation
13:30:56,285 TRACE [f.p.e.c.f.f.FsCrawlerUtil] No pattern always matches.
13:30:56,298 DEBUG [f.p.e.c.f.FsParserAbstract] Indexing test2/2e9faf9314bd6ea61ede8b7559f253b?pipeline=null
13:30:56,298 TRACE [f.p.e.c.f.FsParserAbstract] JSon indexed : {
"content" : "\n \n\n \n\nThis file also contains text. \n\n \n\n\n\nThis second part of the text is in Page 2 \n\n \n\n\n",
"meta" : {
"author" : "David Pilato",
"title" : "Test Tika title",
"date" : "2019-03-02T11:42:36.000+00:00",
"keywords" : [ "keyword1", " keyword2" ],
"language" : "en-US",
"format" : "application/pdf; version=1.7",
"creator_tool" : "Microsoft Word",
"description" : "Test Tika Object",
"created" : "2019-03-02T11:42:36.000+00:00"
},
"file" : {
"extension" : "pdf",
"content_type" : "application/pdf",
"created" : "2020-06-25T10:13:16.999+00:00",
"last_modified" : "2020-06-25T10:13:16.999+00:00",
"last_accessed" : "2020-06-25T10:13:16.999+00:00",
"indexing_date" : "2020-06-25T11:30:54.103+00:00",
"filesize" : 112983,
"filename" : "test-ocr.pdf",
"url" : "file:///home/local/es/test-ocr.pdf"
},
"path" : {
"root" : "8a4fb678d07a821dbc3abeb4aeb337a",
"virtual" : "/test-ocr.pdf",
"real" : "/home/local/es/test-ocr.pdf"
}
}
13:30:56,298 TRACE [f.p.e.c.f.FsParserAbstract] FileAbstractModel = FileAbstractModel{name='index.png', file=true, directory=false, lastModifiedDate=2020-06-25T12:31:29.664488, creationDate=2020-06-25T12:31:29.664488, accessDate=2020-06-25T12:31:29.664488, path='/home/local/es', owner='',