Thank you for youe answer. Content of txt file looks like (russian lang):
1. спирт
2. Лейкопластырь
3. Марлевые бинты
перчатки мед, ножницы острые купить
ES document is (without content filed)
"_index" : "job_name",
"_type" : "_doc",
"_id" : "97a3acd5b3addf1ee3557eed47dafa6",
"_score" : 0.9614111,
"_source" : {
"meta" : { },
"file" : {
"extension" : "txt",
"created" : "2021-07-25T10:07:56.836+00:00",
"last_modified" : "2021-07-25T10:08:01.100+00:00",
"last_accessed" : "2021-07-25T10:07:56.836+00:00",
"indexing_date" : "2021-07-25T10:08:12.208+00:00",
"filesize" : 36,
"filename" : "file.txt",
"url" : """file://\tmp\es\file.txt"""
},
"path" : {
"root" : "3390d1be31e78ad623165b095e7dc7",
"virtual" : "/file.txt",
"real" : """\tmp\es\file.txt"""
}
}
},
log information about this text file:
3:18:26,171 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(\tmp\es, \tmp\es\рюкзак.txt) = /рюкзак.txt
13:18:26,172 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [/рюкзак.txt], includes = [null], excludes = [[*/~*]]
13:18:26,172 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/рюкзак.txt], excludes = [[*/~*]]
13:18:26,172 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/рюкзак.txt], includes = [null]
13:18:26,172 DEBUG [f.p.e.c.f.FsParserAbstract] [/рюкзак.txt] can be indexed: [true]
13:18:26,172 DEBUG [f.p.e.c.f.FsParserAbstract] - file: /рюкзак.txt
13:18:26,173 DEBUG [f.p.e.c.f.FsParserAbstract] **fetching content** from [\tmp\es],[рюкзак.txt]
13:18:26,176 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(\tmp\es, \tmp\es\рюкзак.txt) = /рюкзак.txt
It says fetching content...
and about images: i have installed Tesseract on windows pc where fscrawler is installed, should i install it on centos server with ES as well and confiure it in some way?
I Iinstalled tesseract on windows 7 pc and log file says that:
13:18:26,208 DEBUG [f.p.e.c.f.t.TikaInstance] OCR is activated.
13:18:26,217 DEBUG [f.p.e.c.f.t.TikaInstance] But Tesseract is not installed so we won't run OCR.
13:18:26,750 WARN [o.a.t.p.PDFParser] J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.