FSCrawler errors with large PDF files - Hard failure and timeout

Hi, folks!

I'm trying to index some large PDF files (~70MB) in Elasticsearch using FSCrawler, but not all of them are being indexed, and some exceptions are thrown, like following:

DEBUG [f.p.e.c.f.FsParserAbstract] [/04/dejt-jud_26-04-2019.pdf] can be indexed: [true]
DEBUG [f.p.e.c.f.FsParserAbstract]   - file: /04/dejt-jud_26-04-2019.pdf
DEBUG [f.p.e.c.f.FsParserAbstract] fetching content from [/home/erick/Elastic/DEJT/04],[dejt-jud_26-04-2019.pdf]
DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/home/erick/Elastic/DEJT, /home/erick/Elastic/DEJT/04/dejt-jud_26-04-2019.pdf) = /04/dejt-jud_26-04-2019.pdf
WARN  [f.p.e.c.f.c.v.ElasticsearchClientV7] Got a hard failure when executing the bulk request
java.net.SocketTimeoutException: 30.000 milliseconds timeout on connection http-outgoing-6 [ACTIVE]

My _settings.yaml file:

---
name: "desenv_dejt_jud"
fs:
  url: "/home/erick/Elastic/DEJT"
  update_rate: "30m"
  excludes:
  - ".*dejt-adm.*"
  json_support: false
  filename_as_id: false
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: false
  store_source: false
  index_content: true
  attributes_support: false
  raw_metadata: false
  xml_support: false
  index_folders: true
  lang_detect: false
  continue_on_error: false
  indexed_chars: "100%"
  ocr:
    language: "eng"
    enabled: false
    pdf_strategy: "ocr_and_text"
elasticsearch:
  nodes:
  - url: "http://127.0.0.1:9200"
  bulk_size: 1
  flush_interval: "60s"
  byte_size: "10m"
  pipeline: "dejt_jud_pipeline"

Without "indexed_chars: 100%" it works ok.

I'm running a single Elasticsearch node with Xms10g/Xmx10g, and starting fscrawler like this:

FS_JAVA_OPTS="-Xmx4g -Xms4g" ./fscrawler-es7-2.7-SNAPSHOT/bin/fscrawler desenv_dejt_jud --debug

If i try to index file by file, sometimes the exception are thrown, but document gets indexed.
With all documents in folder, many exceptions "hard failure/timeout 30.000" are thrown, and few documents gets indexed.

I tried some different values for bulk_size, byte_size e flush_interval, but result is the same.

I also looked for a way to increase the Elasticsearch connection timeout value, but found nothing about.

Have any of you seen anything like this before?

Thank you!!

--
Erick

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.