16:20:38,881 DEBUG [f.p.e.c.f.t.TikaInstance] OCR is activated so we need to con
figure Tesseract in case we have specific settings.
16:20:38,881 DEBUG [f.p.e.c.f.t.TikaInstance] Tesseract Language set to [eng].
16:20:43,279 TRACE [f.p.e.c.f.t.TikaDocParser] End of Tika extraction
16:20:44,265 TRACE [f.p.e.c.f.t.TikaDocParser] Main detected language: [en: HIGH
(0.999994)]
16:20:47,553 TRACE [f.p.e.c.f.c.v.ElasticsearchClientV7] Sending a bulk request
of [1] requests
16:20:48,875 TRACE [f.p.e.c.f.t.TikaDocParser] Listing all available metadata:
16:20:48,876 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw.entrySet(), iter
ableWithSize(42));
16:20:48,877 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("date"
, "2020-05-22T14:27:56Z"));
16:20:48,877 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("pdf:u
nmappedUnicodeCharsPerPage", "0"));
16:20:48,878 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("pdf:P
DFVersion", "1.7"));
16:20:48,878 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("pdf:d
ocinfo:title", "tesseract_header.jpg"));
16:20:48,879 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("pdf:h
asXFA", "false"));
16:20:48,880 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("acces
s_permission:modify_annotations", "true"));
16:20:48,880 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("acces
s_permission:can_print_degraded", "true"));
16:20:48,881 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("dc:cr
eator", "z0045ucs"));
16:20:48,882 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("dcter
ms:created", "2020-05-22T14:27:56Z"));
16:20:48,882 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("Last-
Modified", "2020-05-22T14:27:56Z"));
16:20:48,883 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("dcter
ms:modified", "2020-05-22T14:27:56Z"));
16:20:48,884 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("dc:fo
rmat", "application/pdf; version=1.7"));
16:20:48,884 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("title
", "tesseract_header.jpg"));
16:20:48,885 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("Last-
Save-Date", "2020-05-22T14:27:56Z"));
16:20:48,885 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("acces
s_permission:fill_in_form", "true"));
16:20:48,886 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("pdf:d
ocinfo:modified", "2020-05-22T14:27:56Z"));
16:20:48,887 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("meta:
save-date", "2020-05-22T14:27:56Z"));
16:20:48,887 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("pdf:e
ncrypted", "false"));
16:20:48,888 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("dc:ti
tle", "tesseract_header.jpg"));
16:20:48,888 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("modif
ied", "2020-05-22T14:27:56Z"));
16:20:48,889 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("pdf:h
asMarkedContent", "false"));
16:20:48,890 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("Conte
nt-Type", "application/pdf"));
16:20:48,890 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("pdf:d
ocinfo:creator", "z0045ucs"));
16:20:48,891 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("X-Par
sed-By", "org.apache.tika.parser.pdf.PDFParser"));
16:20:48,892 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("creat
or", "z0045ucs"));
16:20:48,893 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("meta:
author", "z0045ucs"));
16:20:48,894 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("meta:
creation-date", "2020-05-22T14:27:56Z"));
16:20:48,895 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("creat
ed", "2020-05-22T14:27:56Z"));
16:20:48,895 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("acces
s_permission:extract_for_accessibility", "true"));
16:20:48,896 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("acces
s_permission:assemble_document", "true"));
16:20:48,897 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("xmpTP
g:NPages", "1"));
16:20:48,898 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("Creat
ion-Date", "2020-05-22T14:27:56Z"));
16:20:48,899 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("resou
rceName", "noisy.pdf"));
16:20:48,900 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("pdf:h
asXMP", "false"));
16:20:48,900 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("pdf:c
harsPerPage", "0"));
16:20:48,901 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("acces
s_permission:extract_content", "true"));
16:20:48,902 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("acces
s_permission:can_print", "true"));
16:20:48,903 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("Autho
r", "z0045ucs"));
16:20:48,903 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("produ
cer", "Microsoft: Print To PDF"));
16:20:48,904 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("acces
s_permission:can_modify", "true"));
16:20:48,905 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("pdf:d
ocinfo:producer", "Microsoft: Print To PDF"));
16:20:48,906 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("pdf:d
ocinfo:created", "2020-05-22T14:27:56Z"));
16:20:48,907 TRACE [f.p.e.c.f.t.TikaDocParser] End document generation
16:20:48,908 TRACE [f.p.e.c.f.f.FsCrawlerUtil] No pattern always matches.
16:20:48,913 TRACE [f.p.e.c.f.c.v.ElasticsearchClientV7] Executed bulk request w
ith [1] requests
16:20:48,928 DEBUG [f.p.e.c.f.FsParserAbstract] Indexing testing_tesseract_v5/no
isy.pdf?pipeline=null
16:20:48,928 TRACE [f.p.e.c.f.FsParserAbstract] JSon indexed : {
"content" : "\n \n\nCcv)malcolm:tesseract-python adrianrosebrock$ python ocr.
py --image images/example_@1.png\nNoisy image\nto test\nOCS 1 1yeln edd\n\n
\n \n \n\nNoisy image\nto test\nTesseract OCR\n\n \n \n\n\n",
"meta" : {
"author" : "z0045ucs",
"title" : "tesseract_header.jpg",
"date" : "2020-05-22T12:27:56.000+00:00",
"language" : "en",
"format" : "application/pdf; version=1.7",
"created" : "2020-05-22T12:27:56.000+00:00",
"raw" : {
"date" : "2020-05-22T14:27:56Z",
"pdf:unmappedUnicodeCharsPerPage" : "0",
"pdf:PDFVersion" : "1.7",
"pdf:docinfo:title" : "tesseract_header.jpg",
"pdf:hasXFA" : "false",
"access_permission:modify_annotations" : "true",
"access_permission:can_print_degraded" : "true",
"dc:creator" : "z0045ucs",
"dcterms:created" : "2020-05-22T14:27:56Z",
"Last-Modified" : "2020-05-22T14:27:56Z",
"dcterms:modified" : "2020-05-22T14:27:56Z",
"dc:format" : "application/pdf; version=1.7",
"title" : "tesseract_header.jpg",
"Last-Save-Date" : "2020-05-22T14:27:56Z",
"access_permission:fill_in_form" : "true",
"pdf:docinfo:modified" : "2020-05-22T14:27:56Z",
"meta:save-date" : "2020-05-22T14:27:56Z",
"pdf:encrypted" : "false",
"dc:title" : "tesseract_header.jpg",
"modified" : "2020-05-22T14:27:56Z",
"pdf:hasMarkedContent" : "false",
"Content-Type" : "application/pdf",
"pdf:docinfo:creator" : "z0045ucs",
"X-Parsed-By" : "org.apache.tika.parser.pdf.PDFParser",
"creator" : "z0045ucs",
"meta:author" : "z0045ucs",
"meta:creation-date" : "2020-05-22T14:27:56Z",
"created" : "2020-05-22T14:27:56Z",
"access_permission:extract_for_accessibility" : "true",
"access_permission:assemble_document" : "true",
"xmpTPg:NPages" : "1",
"Creation-Date" : "2020-05-22T14:27:56Z",
"resourceName" : "noisy.pdf",
"pdf:hasXMP" : "false",
"pdf:charsPerPage" : "0",
"access_permission:extract_content" : "true",
"access_permission:can_print" : "true",
"Author" : "z0045ucs",
"producer" : "Microsoft: Print To PDF",
"access_permission:can_modify" : "true",
"pdf:docinfo:producer" : "Microsoft: Print To PDF",
"pdf:docinfo:created" : "2020-05-22T14:27:56Z"
}
},
"file" : {
"extension" : "pdf",
"content_type" : "application/pdf",
"created" : "2020-05-28T14:12:41.781+00:00",
"last_modified" : "2020-05-22T14:27:57.047+00:00",
"last_accessed" : "2020-05-28T14:12:41.781+00:00",
"indexing_date" : "2020-05-28T14:20:38.100+00:00",
"filesize" : 43987,
"filename" : "noisy.pdf",
"url" : "file://C:\\Data_Privacy_GAT\\Testing_Tesseract_2\\noisy.pdf"
},
"path" : {
"root" : "501a70282ead4e6535ce27023b95d",
"virtual" : "/noisy.pdf",
"real" : "C:\\Data_Privacy_GAT\\Testing_Tesseract_2\\noisy.pdf"
}
}
16:20:48,949 DEBUG [f.p.e.c.f.FsParserAbstract] Looking for removed files in [C:
\Data_Privacy_GAT\Testing_Tesseract_2]...
16:20:48,950 TRACE [f.p.e.c.f.FsParserAbstract] Querying elasticsearch for files
in dir [path.root:501a70282ead4e6535ce27023b95d]
16:20:48,986 TRACE [f.p.e.c.f.FsParserAbstract] Response [fr.pilato.elasticsearc
h.crawler.fs.client.ESSearchResponse@592b622c]
16:20:48,987 TRACE [f.p.e.c.f.FsParserAbstract] We found: []
16:20:48,987 DEBUG [f.p.e.c.f.FsParserAbstract] Looking for removed directories
in [C:\Data_Privacy_GAT\Testing_Tesseract_2]...
16:20:49,001 DEBUG [f.p.e.c.f.FsParserAbstract] Fs crawler is going to sleep for
15m
16:20:58,884 TRACE [f.p.e.c.f.c.v.ElasticsearchClientV7] Sending a bulk request
of [1] requests
16:20:58,942 TRACE [f.p.e.c.f.c.v.ElasticsearchClientV7] Executed bulk request w
ith [1] requests
Thanks for your help.