FSCrawler: Error while crawling - Invalid UTF-8 start byte 0xb5

Yamini_Shashank · April 28, 2020, 8:00am

I'v installed "fscrawler-es7-2.7-SNAPSHOT" .
I'v created a PDF from MS Word with simple text.
Here is a snapshot:

Now i'v placed it in a folder structure such as:

My configuration in yaml:

---
name: "hvr"
fs:
  url: "F:\\root_folder_in_snapshot"
  update_rate: "1m"
  excludes:
  - "*/~*"
  json_support: false
  filename_as_id: false
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: false
  store_source: false
  index_content: true
  attributes_support: true
  raw_metadata: false
  xml_support: true
  index_folders: true
  lang_detect: false
  continue_on_error: false
  ocr:
    language: "eng"
    enabled: true
    pdf_strategy: "ocr_and_text"
    path: "F:\\Tesseract-OCR"
  follow_symlinks: false
elasticsearch:
  nodes:
  - url: "myelasticurl"
  bulk_size: 100
  flush_interval: "5s"
  byte_size: "10mb"

But i end up with this error:

Any help will be appreciated.

Yamini_Shashank · April 28, 2020, 8:11am

Apologies, the issue was because of the

"xml_support: true"

dadoonet · April 28, 2020, 10:17am

Great that you found what the problem was and shared the solution. I meant when we spoke on Github that you open a new issue on Github, not specifically here but that's fine as it's not lost anywhere.

For your next post, please don't post images of text as they are hard to read, may not display correctly for everyone, and are not searchable.

Yamini_Shashank · April 28, 2020, 10:27am

Actually i seem to have bumped into another error:

10:23:48,194 INFO  [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [418.8mb/7.1gb=5.75%], RAM [9.9gb/31.9gb=31.16%], Swap [7gb/39.9gb=17.64%].
10:23:48,529 INFO  [f.p.e.c.f.c.FsCrawlerCli] attributes_support is set to true but getting group is not available on [windows server 2016].
10:23:48,540 INFO  [f.p.e.c.f.FsCrawlerImpl] attributes_support is set to true but getting group is not available on [windows server 2016].
10:23:49,086 INFO  [f.p.e.c.f.c.v.ElasticsearchClientV7] Elasticsearch Client for version 7.x connected to a node running version 7.5.0
10:23:49,147 INFO  [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
10:23:49,147 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler started in watch mode. It will run unless you stop it with CTRL+C.
10:23:49,353 INFO  [f.p.e.c.f.FsParserAbstract] FS crawler started for [hvr] for [F:\hvr_copy] every [1m]
10:23:49,681 WARN  [o.a.t.p.PDFParser] J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

10:24:50,586 WARN  [f.p.e.c.f.FsParserAbstract] Can't find stored field name to check existing filenames in path [F:\hvr_copy\00\2a]. Please set store: true on field [file.filename]
10:24:50,586 WARN  [f.p.e.c.f.FsParserAbstract] Error while crawling F:\hvr_copy: Mapping is incorrect: please set stored: true on field [file.filename].

Do i need to create a elasticsearch schema mapping prior to crawling data to an index?

dadoonet · April 28, 2020, 10:47am

This probably means that the index has been created before FSCrawler started. And created with an incompatible mapping. If you don't want to do anything specific with the mapping and you don't care of the existing data, just delete the index:

DELETE hvr*

And restart FSCrawler:

bin/fscrawler hvr --restart

Yamini_Shashank · April 28, 2020, 11:56am

It Worked! Thank you!

system · May 26, 2020, 11:56am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
FSCrawler: Error while crawling - Invalid UTF-8 start byte 0xff Elasticsearch	1	1033	April 15, 2021
Fscrawler Elasticsearch	2	2946	September 28, 2017
Fscrawler does not index to ES with https Elasticsearch	4	1033	October 27, 2020
FSCrawler && SSL && SANs Elasticsearch	29	1054	November 2, 2022
FS Crawler - Issue with OCR Elasticsearch docker	7	928	September 2, 2022

FSCrawler: Error while crawling - Invalid UTF-8 start byte 0xb5

Related topics