Fscrawler image file text extraction

Jenya87 · July 24, 2021, 6:12pm

Hello!

I use FScrawler 2.7 on Windows 7 to ingest files content into Elastic 7.13 node on centos 7.

It works really greate for .pdf and .docx extensions, but not for imge files such as .jpg or .png and .txt files. My job settings file:

---
name: "job_name"
fs:
  url: "\\tmp\\es"
  update_rate: "15m"
  excludes:
  - "*/~*"
  json_support: false
  filename_as_id: false
  add_filesize: true
  remove_deleted: false
  add_as_inner_object: false
  store_source: false
  index_content: true
  attributes_support: false
  raw_metadata: false
  xml_support: false
  index_folders: true
  lang_detect: false
  continue_on_error: false
  ocr:
    language: "ru"
    enabled: true
    pdf_strategy: "ocr_and_text"
  follow_symlinks: false
elasticsearch:
  nodes:
  - url: "http://192.168.1.1:9200"
  bulk_size: 100
  flush_interval: "5s"
  byte_size: "10mb"
  ssl_verification: false

How to configure job to extract text information form image files and txt?
Thanks.

dadoonet · July 24, 2021, 10:10pm

For txt files, it should work out of the box. Do you have a txt file to share and what the document in elasticsearch looks like?

For images, it mostly depends if you have installed Tesseract for ocr or not.

Jenya87 · July 25, 2021, 10:25am

Thank you for youe answer. Content of txt file looks like (russian lang):

1. спирт
2. Лейкопластырь
3. Марлевые бинты
перчатки мед, ножницы острые купить

ES document is (without content filed)

        "_index" : "job_name",
        "_type" : "_doc",
        "_id" : "97a3acd5b3addf1ee3557eed47dafa6",
        "_score" : 0.9614111,
        "_source" : {
          "meta" : { },
          "file" : {
            "extension" : "txt",
            "created" : "2021-07-25T10:07:56.836+00:00",
            "last_modified" : "2021-07-25T10:08:01.100+00:00",
            "last_accessed" : "2021-07-25T10:07:56.836+00:00",
            "indexing_date" : "2021-07-25T10:08:12.208+00:00",
            "filesize" : 36,
            "filename" : "file.txt",
            "url" : """file://\tmp\es\file.txt"""
          },
          "path" : {
            "root" : "3390d1be31e78ad623165b095e7dc7",
            "virtual" : "/file.txt",
            "real" : """\tmp\es\file.txt"""
          }
        }
      },

log information about this text file:

3:18:26,171 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(\tmp\es, \tmp\es\рюкзак.txt) = /рюкзак.txt
13:18:26,172 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] directory = [false], filename = [/рюкзак.txt], includes = [null], excludes = [[*/~*]]
13:18:26,172 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/рюкзак.txt], excludes = [[*/~*]]
13:18:26,172 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] filename = [/рюкзак.txt], includes = [null]
13:18:26,172 DEBUG [f.p.e.c.f.FsParserAbstract] [/рюкзак.txt] can be indexed: [true]
13:18:26,172 DEBUG [f.p.e.c.f.FsParserAbstract]   - file: /рюкзак.txt
13:18:26,173 DEBUG [f.p.e.c.f.FsParserAbstract] **fetching content** from [\tmp\es],[рюкзак.txt]
13:18:26,176 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(\tmp\es, \tmp\es\рюкзак.txt) = /рюкзак.txt

It says fetching content...

and about images: i have installed Tesseract on windows pc where fscrawler is installed, should i install it on centos server with ES as well and confiure it in some way?
I Iinstalled tesseract on windows 7 pc and log file says that:

13:18:26,208 DEBUG [f.p.e.c.f.t.TikaInstance] OCR is activated.
13:18:26,217 DEBUG [f.p.e.c.f.t.TikaInstance] But Tesseract is not installed so we won't run OCR.
13:18:26,750 WARN  [o.a.t.p.PDFParser] J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

dadoonet · July 25, 2021, 10:38am

Could you open an issue in FSCrawler an share your files there? So I can test locally?

Jenya87 · July 25, 2021, 12:58pm

Thanks, I open github issue here.

But can you answer me about image recognation and tesseract from my previous message, please?

dadoonet · July 25, 2021, 2:26pm

Tesseract should be available from the path. If not, you can configure the path manually. See OCR integration — FSCrawler 2.7-SNAPSHOT documentation

Jenya87 · July 25, 2021, 5:50pm

here is working configuration for me (tesseract is installed on windows)

---
name: "second"
fs:
  url: "C:\\tmp\\es"
  update_rate: "2m"
  excludes:
  - "*/~*"
  json_support: false
  filename_as_id: false
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: false
  store_source: false
  index_content: true
  attributes_support: false
  raw_metadata: false
  xml_support: false
  index_folders: true
  lang_detect: false
  continue_on_error: false
  ocr:
    language: "rus"
    enabled: true
    path: "C:/Program Files/Tesseract-OCR"
    data_path: "C:/Program Files/Tesseract-OCR/tessdata" 
    pdf_strategy: "auto"
  follow_symlinks: false
elasticsearch:
  nodes:
  - url: "http://192.168.1.1:9200"
  bulk_size: 100
  flush_interval: "5s"
  byte_size: "10mb"
  ssl_verification: false

Thanks!

system · August 22, 2021, 5:50pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elastic search and fscrawler Elasticsearch	5	367	December 12, 2018
Read image text from pdf Elasticsearch	54	5234	June 7, 2017
Problem when using Elasticsearch and Tesseract-OCR Elasticsearch	15	2067	August 19, 2020
Fscrawler does not index to ES with https Elasticsearch	4	1033	October 27, 2020
Fscrawler ocr question Elasticsearch	2	333	October 23, 2019

Fscrawler image file text extraction

Related topics