FSCrawler: Error while crawling - Invalid UTF-8 start byte 0xb5

Hi @dadoonet,

I'v installed "fscrawler-es7-2.7-SNAPSHOT" .
I'v created a PDF from MS Word with simple text.
Here is a snapshot:
image

Now i'v placed it in a folder structure such as:
image

My configuration in yaml:

---
name: "hvr"
fs:
  url: "F:\\root_folder_in_snapshot"
  update_rate: "1m"
  excludes:
  - "*/~*"
  json_support: false
  filename_as_id: false
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: false
  store_source: false
  index_content: true
  attributes_support: true
  raw_metadata: false
  xml_support: true
  index_folders: true
  lang_detect: false
  continue_on_error: false
  ocr:
    language: "eng"
    enabled: true
    pdf_strategy: "ocr_and_text"
    path: "F:\\Tesseract-OCR"
  follow_symlinks: false
elasticsearch:
  nodes:
  - url: "myelasticurl"
  bulk_size: 100
  flush_interval: "5s"
  byte_size: "10mb"

But i end up with this error:

Any help will be appreciated.

Apologies, the issue was because of the

"xml_support: true"

Great that you found what the problem was and shared the solution. I meant when we spoke on Github that you open a new issue on Github, not specifically here but that's fine as it's not lost anywhere. :wink:

For your next post, please don't post images of text as they are hard to read, may not display correctly for everyone, and are not searchable.

1 Like

Actually i seem to have bumped into another error:

10:23:48,194 INFO  [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [418.8mb/7.1gb=5.75%], RAM [9.9gb/31.9gb=31.16%], Swap [7gb/39.9gb=17.64%].
10:23:48,529 INFO  [f.p.e.c.f.c.FsCrawlerCli] attributes_support is set to true but getting group is not available on [windows server 2016].
10:23:48,540 INFO  [f.p.e.c.f.FsCrawlerImpl] attributes_support is set to true but getting group is not available on [windows server 2016].
10:23:49,086 INFO  [f.p.e.c.f.c.v.ElasticsearchClientV7] Elasticsearch Client for version 7.x connected to a node running version 7.5.0
10:23:49,147 INFO  [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
10:23:49,147 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler started in watch mode. It will run unless you stop it with CTRL+C.
10:23:49,353 INFO  [f.p.e.c.f.FsParserAbstract] FS crawler started for [hvr] for [F:\hvr_copy] every [1m]
10:23:49,681 WARN  [o.a.t.p.PDFParser] J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

10:24:50,586 WARN  [f.p.e.c.f.FsParserAbstract] Can't find stored field name to check existing filenames in path [F:\hvr_copy\00\2a]. Please set store: true on field [file.filename]
10:24:50,586 WARN  [f.p.e.c.f.FsParserAbstract] Error while crawling F:\hvr_copy: Mapping is incorrect: please set stored: true on field [file.filename].

Do i need to create a elasticsearch schema mapping prior to crawling data to an index?

This probably means that the index has been created before FSCrawler started. And created with an incompatible mapping. If you don't want to do anything specific with the mapping and you don't care of the existing data, just delete the index:

DELETE hvr*

And restart FSCrawler:

bin/fscrawler hvr --restart
1 Like

It Worked! Thank you!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.