Hi, folks!
I'm trying to index some large PDF files (~70MB) in Elasticsearch using FSCrawler, but not all of them are being indexed, and some exceptions are thrown, like following:
DEBUG [f.p.e.c.f.FsParserAbstract] [/04/dejt-jud_26-04-2019.pdf] can be indexed: [true]
DEBUG [f.p.e.c.f.FsParserAbstract] - file: /04/dejt-jud_26-04-2019.pdf
DEBUG [f.p.e.c.f.FsParserAbstract] fetching content from [/home/erick/Elastic/DEJT/04],[dejt-jud_26-04-2019.pdf]
DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/home/erick/Elastic/DEJT, /home/erick/Elastic/DEJT/04/dejt-jud_26-04-2019.pdf) = /04/dejt-jud_26-04-2019.pdf
WARN [f.p.e.c.f.c.v.ElasticsearchClientV7] Got a hard failure when executing the bulk request
java.net.SocketTimeoutException: 30.000 milliseconds timeout on connection http-outgoing-6 [ACTIVE]
My _settings.yaml file:
---
name: "desenv_dejt_jud"
fs:
url: "/home/erick/Elastic/DEJT"
update_rate: "30m"
excludes:
- ".*dejt-adm.*"
json_support: false
filename_as_id: false
add_filesize: true
remove_deleted: true
add_as_inner_object: false
store_source: false
index_content: true
attributes_support: false
raw_metadata: false
xml_support: false
index_folders: true
lang_detect: false
continue_on_error: false
indexed_chars: "100%"
ocr:
language: "eng"
enabled: false
pdf_strategy: "ocr_and_text"
elasticsearch:
nodes:
- url: "http://127.0.0.1:9200"
bulk_size: 1
flush_interval: "60s"
byte_size: "10m"
pipeline: "dejt_jud_pipeline"
Without "indexed_chars: 100%" it works ok.
I'm running a single Elasticsearch node with Xms10g/Xmx10g, and starting fscrawler like this:
FS_JAVA_OPTS="-Xmx4g -Xms4g" ./fscrawler-es7-2.7-SNAPSHOT/bin/fscrawler desenv_dejt_jud --debug
If i try to index file by file, sometimes the exception are thrown, but document gets indexed.
With all documents in folder, many exceptions "hard failure/timeout 30.000" are thrown, and few documents gets indexed.
I tried some different values for bulk_size, byte_size e flush_interval, but result is the same.
I also looked for a way to increase the Elasticsearch connection timeout value, but found nothing about.
Have any of you seen anything like this before?
Thank you!!
--
Erick