Hello,
I'm trying to use FS Crawler 2.6 in a Windows Server machine to index a huge number of files in my company. It's a very large Windows folders tree in a network drive: 14,92 Tb size, 7,2M files in 2,3M folders. Data are in a remote filer, in the same data center.
The last execution took 3 days long and indexed 1.223.386 docs, but was stopped by the following error:
00:01:23,406 DEBUG [f.p.e.c.f.FsParserAbstract] Indexing disc-files-docs/_doc/d7f22c716648b86afd386e2ca91aeda?pipeline=null
00:01:23,406 DEBUG [f.p.e.c.f.FsParserAbstract] Looking for removed files in [V:\ISC\AA\XX\YY\ZZ\KK\HH\ACESSOS]...
00:01:24,146 WARN [f.p.e.c.f.FsParserAbstract] Error while crawling V:\ISC: Connection closed
00:01:24,146 WARN [f.p.e.c.f.FsParserAbstract] Full stacktrace
org.apache.http.ConnectionClosedException: Connection closed
at org.elasticsearch.client.RestClient$SyncResponseListener.get(RestClient.java:939) ~[elasticsearch-rest-client-6.5.3.jar:6.5.3]
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:229) ~[elasticsearch-rest-client-6.5.3.jar:6.5.3]
at org.elasticsearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1593) ~[elasticsearch-rest-high-level-client-6.5.3.jar:6.5.3]
at org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:1563) ~[elasticsearch-rest-high-level-client-6.5.3.jar:6.5.3]
at org.elasticsearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:1525) ~[elasticsearch-rest-high-level-client-6.5.3.jar:6.5.3]
at org.elasticsearch.client.RestHighLevelClient.search(RestHighLevelClient.java:990) ~[elasticsearch-rest-high-level-client-6.5.3.jar:6.5.3]
at fr.pilato.elasticsearch.crawler.fs.client.v6.ElasticsearchClientV6.search(ElasticsearchClientV6.java:489) ~[fscrawler-elasticsearch-client-v6-2.6.jar:?]
at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.getFileDirectory(FsParserAbstract.java:363) ~[fscrawler-core-2.6.jar:?]
at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:317) ~[fscrawler-core-2.6.jar:?]
at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:299) ~[fscrawler-core-2.6.jar:?]
at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:299) ~[fscrawler-core-2.6.jar:?]
at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:299) ~[fscrawler-core-2.6.jar:?]
at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:299) ~[fscrawler-core-2.6.jar:?]
at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:299) ~[fscrawler-core-2.6.jar:?]
at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:299) ~[fscrawler-core-2.6.jar:?]
at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:299) ~[fscrawler-core-2.6.jar:?]
at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.run(FsParserAbstract.java:157) [fscrawler-core-2.6.jar:?]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_211]
Caused by: org.apache.http.ConnectionClosedException: Connection closed
at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.endOfInput(HttpAsyncRequestExecutor.java:344) ~[httpcore-nio-4.4.5.jar:4.4.5]
at org.apache.http.impl.nio.DefaultNHttpClientConnection.consumeInput(DefaultNHttpClientConnection.java:261) ~[httpcore-nio-4.4.5.jar:4.4.5]
... 1 more
00:01:24,147 INFO [f.p.e.c.f.FsParserAbstract] FS crawler is stopping after 1 run
00:01:24,245 DEBUG [f.p.e.c.f.FsCrawlerImpl] Closing FS crawler [disc-files-dev]
00:01:24,245 DEBUG [f.p.e.c.f.FsCrawlerImpl] FS crawler thread is now stopped
00:01:24,246 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV6] Closing Elasticsearch client manager
00:01:24,305 DEBUG [f.p.e.c.f.FsCrawlerImpl] ES Client Manager stopped
00:01:24,306 INFO [f.p.e.c.f.FsCrawlerImpl] FS crawler [disc-files-dev] stopped
00:01:24,309 DEBUG [f.p.e.c.f.FsCrawlerImpl] Closing FS crawler [disc-files-dev]
00:01:24,312 DEBUG [f.p.e.c.f.FsCrawlerImpl] FS crawler thread is now stopped
00:01:24,312 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV6] Closing Elasticsearch client manager
00:01:24,313 DEBUG [f.p.e.c.f.FsCrawlerImpl] ES Client Manager stopped
00:01:24,313 INFO [f.p.e.c.f.FsCrawlerImpl] FS crawler [disc-files-dev] stopped
Other attempts were stopped by the same error in diferent points of folders structure.
_settings.json is the following:
{
"name" : "disc-files-dev",
"fs" : {
"url" : "V:\\ISC",
"update_rate" : "120h",
"indexed_chars" : 5000,
"includes": [
"*/*.doc",
"*/*.pdf",
"*/*.csv",
"*/*.doc",
"*/*.docx",
"*/*.ods",
"*/*.odp",
"*/*.odt",
"*/*.pdf",
"*/*.pps",
"*/*.ppsx",
"*/*.ppt",
"*/*.pptx",
"*/*.rtf",
"*/*.txt",
"*/*.wps",
"*/*.xls",
"*/*.xlsx",
"*/*.xlsm",
"*/*.xps"
],
"excludes": [
"*/~*",
"*/*.tmp",
"*/*.eml",
"*/*.jpg",
"*/*.png",
"*/ISC/XX*",
],
"json_support" : false,
"follow_symlink" : false,
"filename_as_id" : false,
"add_filesize" : true,
"remove_deleted" : true,
"add_as_inner_object" : true,
"store_source" : false,
"index_content" : true,
"attributes_support" : false,
"raw_metadata" : true,
"xml_support" : false,
"index_folders" : true,
"lang_detect" : false,
"continue_on_error" : false,
"ignore_above": "20mb",
"pdf_ocr" : false,
"ocr" : {
"language" : "eng"
}
},
"elasticsearch" : {
"nodes" : [ {
"url" : "https://elasticsearch-dsv.xxxxxx.com"
} ],
"bulk_size" : 100,
"flush_interval" : "6s",
"byte_size" : "5mb",
"username" : "disc-all",
"password" : "xxxxxxx",
"index" : "disc-files-docs",
"index_folder": "disc-files-folders"
},
"rest" : {
"url" : "http://127.0.0.1:8080/fscrawler"
}
}
I realized this error occurs always after a line like this is DEBUG information:
DEBUG [f.p.e.c.f.FsParserAbstract] Looking for removed files in...
There are no error logs in the Apache / Elasticsearch side.
Any help?
Thanks at all