Fscrawler image file text extraction

Hello!

I use FScrawler 2.7 on Windows 7 to ingest files content into Elastic 7.13 node on centos 7.

It works really greate for .pdf and .docx extensions, but not for imge files such as .jpg or .png and .txt files. My job settings file:

---
name: "job_name"
fs:
  url: "\\tmp\\es"
  update_rate: "15m"
  excludes:
  - "*/~*"
  json_support: false
  filename_as_id: false
  add_filesize: true
  remove_deleted: false
  add_as_inner_object: false
  store_source: false
  index_content: true
  attributes_support: false
  raw_metadata: false
  xml_support: false
  index_folders: true
  lang_detect: false
  continue_on_error: false
  ocr:
    language: "ru"
    enabled: true
    pdf_strategy: "ocr_and_text"
  follow_symlinks: false
elasticsearch:
  nodes:
  - url: "http://192.168.1.1:9200"
  bulk_size: 100
  flush_interval: "5s"
  byte_size: "10mb"
  ssl_verification: false

How to configure job to extract text information form image files and txt?
Thanks.

For txt files, it should work out of the box. Do you have a txt file to share and what the document in elasticsearch looks like?

For images, it mostly depends if you have installed Tesseract for ocr or not.

Thank you for youe answer. Content of txt file looks like (russian lang):

1. спирт
2. Лейкопластырь
3. Марлевые бинты
перчатки мед, ножницы острые купить

ES document is (without content filed)

        "_index" : "job_name",
        "_type" : "_doc",
        "_id" : "97a3acd5b3addf1ee3557eed47dafa6",
        "_score" : 0.9614111,
        "_source" : {
          "meta" : { },
          "file" : {
            "extension" : "txt",
            "created" : "2021-07-25T10:07:56.836+00:00",
            "last_modified" : "2021-07-25T10:08:01.100+00:00",
            "last_accessed" : "2021-07-25T10:07:56.836+00:00",
            "indexing_date" : "2021-07-25T10:08:12.208+00:00",
            "filesize" : 36,
            "filename" : "file.txt",
            "url" : """file://\tmp\es\file.txt"""
          },
          "path" : {
            "root" : "3390d1be31e78ad623165b095e7dc7",
            "virtual" : "/file.txt",
            "real" : """\tmp\es\file.txt"""
          }
        }
      },

log information about this text file:

3:18:26,171 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(\tmp\es, \tmp\es\рюкзак.txt) = /рюкзак.txt
13:18:26,172 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [/рюкзак.txt], includes = [null], excludes = [[*/~*]]
13:18:26,172 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/рюкзак.txt], excludes = [[*/~*]]
13:18:26,172 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/рюкзак.txt], includes = [null]
13:18:26,172 DEBUG [f.p.e.c.f.FsParserAbstract] [/рюкзак.txt] can be indexed: [true]
13:18:26,172 DEBUG [f.p.e.c.f.FsParserAbstract]   - file: /рюкзак.txt
13:18:26,173 DEBUG [f.p.e.c.f.FsParserAbstract] **fetching content** from [\tmp\es],[рюкзак.txt]
13:18:26,176 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(\tmp\es, \tmp\es\рюкзак.txt) = /рюкзак.txt

It says fetching content...

and about images: i have installed Tesseract on windows pc where fscrawler is installed, should i install it on centos server with ES as well and confiure it in some way?
I Iinstalled tesseract on windows 7 pc and log file says that:

13:18:26,208 DEBUG [f.p.e.c.f.t.TikaInstance] OCR is activated.
13:18:26,217 DEBUG [f.p.e.c.f.t.TikaInstance] But Tesseract is not installed so we won't run OCR.
13:18:26,750 WARN  [o.a.t.p.PDFParser] J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

Could you open an issue in FSCrawler an share your files there? So I can test locally?

Thanks, I open github issue here.

But can you answer me about image recognation and tesseract from my previous message, please?

Tesseract should be available from the path. If not, you can configure the path manually. See OCR integration — FSCrawler 2.7-SNAPSHOT documentation

here is working configuration for me (tesseract is installed on windows)

---
name: "second"
fs:
  url: "C:\\tmp\\es"
  update_rate: "2m"
  excludes:
  - "*/~*"
  json_support: false
  filename_as_id: false
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: false
  store_source: false
  index_content: true
  attributes_support: false
  raw_metadata: false
  xml_support: false
  index_folders: true
  lang_detect: false
  continue_on_error: false
  ocr:
    language: "rus"
    enabled: true
    path: "C:/Program Files/Tesseract-OCR"
    data_path: "C:/Program Files/Tesseract-OCR/tessdata" 
    pdf_strategy: "auto"
  follow_symlinks: false
elasticsearch:
  nodes:
  - url: "http://192.168.1.1:9200"
  bulk_size: 100
  flush_interval: "5s"
  byte_size: "10mb"
  ssl_verification: false

Thanks!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.