FSCrawler errors with large PDF files - Hard failure and timeout

Hi, folks!

I'm trying to index some large PDF files (~70MB) in Elasticsearch using FSCrawler, but not all of them are being indexed, and some exceptions are thrown, like following:

DEBUG [f.p.e.c.f.FsParserAbstract] [/04/dejt-jud_26-04-2019.pdf] can be indexed: [true]
DEBUG [f.p.e.c.f.FsParserAbstract]   - file: /04/dejt-jud_26-04-2019.pdf
DEBUG [f.p.e.c.f.FsParserAbstract] fetching content from [/home/erick/Elastic/DEJT/04],[dejt-jud_26-04-2019.pdf]
DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/home/erick/Elastic/DEJT, /home/erick/Elastic/DEJT/04/dejt-jud_26-04-2019.pdf) = /04/dejt-jud_26-04-2019.pdf
WARN  [f.p.e.c.f.c.v.ElasticsearchClientV7] Got a hard failure when executing the bulk request
java.net.SocketTimeoutException: 30.000 milliseconds timeout on connection http-outgoing-6 [ACTIVE]

My _settings.yaml file:

name: "desenv_dejt_jud"
  url: "/home/erick/Elastic/DEJT"
  update_rate: "30m"
  - ".*dejt-adm.*"
  json_support: false
  filename_as_id: false
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: false
  store_source: false
  index_content: true
  attributes_support: false
  raw_metadata: false
  xml_support: false
  index_folders: true
  lang_detect: false
  continue_on_error: false
  indexed_chars: "100%"
    language: "eng"
    enabled: false
    pdf_strategy: "ocr_and_text"
  - url: ""
  bulk_size: 1
  flush_interval: "60s"
  byte_size: "10m"
  pipeline: "dejt_jud_pipeline"

Without "indexed_chars: 100%" it works ok.

I'm running a single Elasticsearch node with Xms10g/Xmx10g, and starting fscrawler like this:

FS_JAVA_OPTS="-Xmx4g -Xms4g" ./fscrawler-es7-2.7-SNAPSHOT/bin/fscrawler desenv_dejt_jud --debug

If i try to index file by file, sometimes the exception are thrown, but document gets indexed.
With all documents in folder, many exceptions "hard failure/timeout 30.000" are thrown, and few documents gets indexed.

I tried some different values for bulk_size, byte_size e flush_interval, but result is the same.

I also looked for a way to increase the Elasticsearch connection timeout value, but found nothing about.

Have any of you seen anything like this before?

Thank you!!


This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.