FS Crawler exception "Error while crawling - Connection Closed"

Hello,

I'm trying to use FS Crawler 2.6 in a Windows Server machine to index a huge number of files in my company. It's a very large Windows folders tree in a network drive: 14,92 Tb size, 7,2M files in 2,3M folders. Data are in a remote filer, in the same data center.

The last execution took 3 days long and indexed 1.223.386 docs, but was stopped by the following error:

00:01:23,406 DEBUG [f.p.e.c.f.FsParserAbstract] Indexing disc-files-docs/_doc/d7f22c716648b86afd386e2ca91aeda?pipeline=null
00:01:23,406 DEBUG [f.p.e.c.f.FsParserAbstract] Looking for removed files in [V:\ISC\AA\XX\YY\ZZ\KK\HH\ACESSOS]...
00:01:24,146 WARN  [f.p.e.c.f.FsParserAbstract] Error while crawling V:\ISC: Connection closed
00:01:24,146 WARN  [f.p.e.c.f.FsParserAbstract] Full stacktrace
org.apache.http.ConnectionClosedException: Connection closed
        at org.elasticsearch.client.RestClient$SyncResponseListener.get(RestClient.java:939) ~[elasticsearch-rest-client-6.5.3.jar:6.5.3]
        at org.elasticsearch.client.RestClient.performRequest(RestClient.java:229) ~[elasticsearch-rest-client-6.5.3.jar:6.5.3]
        at org.elasticsearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1593) ~[elasticsearch-rest-high-level-client-6.5.3.jar:6.5.3]
        at org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:1563) ~[elasticsearch-rest-high-level-client-6.5.3.jar:6.5.3]
        at org.elasticsearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:1525) ~[elasticsearch-rest-high-level-client-6.5.3.jar:6.5.3]
        at org.elasticsearch.client.RestHighLevelClient.search(RestHighLevelClient.java:990) ~[elasticsearch-rest-high-level-client-6.5.3.jar:6.5.3]
        at fr.pilato.elasticsearch.crawler.fs.client.v6.ElasticsearchClientV6.search(ElasticsearchClientV6.java:489) ~[fscrawler-elasticsearch-client-v6-2.6.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.getFileDirectory(FsParserAbstract.java:363) ~[fscrawler-core-2.6.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:317) ~[fscrawler-core-2.6.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:299) ~[fscrawler-core-2.6.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:299) ~[fscrawler-core-2.6.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:299) ~[fscrawler-core-2.6.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:299) ~[fscrawler-core-2.6.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:299) ~[fscrawler-core-2.6.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:299) ~[fscrawler-core-2.6.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:299) ~[fscrawler-core-2.6.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.run(FsParserAbstract.java:157) [fscrawler-core-2.6.jar:?]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_211]
Caused by: org.apache.http.ConnectionClosedException: Connection closed
        at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.endOfInput(HttpAsyncRequestExecutor.java:344) ~[httpcore-nio-4.4.5.jar:4.4.5]
        at org.apache.http.impl.nio.DefaultNHttpClientConnection.consumeInput(DefaultNHttpClientConnection.java:261) ~[httpcore-nio-4.4.5.jar:4.4.5]
        ... 1 more
00:01:24,147 INFO  [f.p.e.c.f.FsParserAbstract] FS crawler is stopping after 1 run
00:01:24,245 DEBUG [f.p.e.c.f.FsCrawlerImpl] Closing FS crawler [disc-files-dev]
00:01:24,245 DEBUG [f.p.e.c.f.FsCrawlerImpl] FS crawler thread is now stopped
00:01:24,246 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV6] Closing Elasticsearch client manager
00:01:24,305 DEBUG [f.p.e.c.f.FsCrawlerImpl] ES Client Manager stopped
00:01:24,306 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler [disc-files-dev] stopped
00:01:24,309 DEBUG [f.p.e.c.f.FsCrawlerImpl] Closing FS crawler [disc-files-dev]
00:01:24,312 DEBUG [f.p.e.c.f.FsCrawlerImpl] FS crawler thread is now stopped
00:01:24,312 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV6] Closing Elasticsearch client manager
00:01:24,313 DEBUG [f.p.e.c.f.FsCrawlerImpl] ES Client Manager stopped
00:01:24,313 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler [disc-files-dev] stopped

Other attempts were stopped by the same error in diferent points of folders structure.

_settings.json is the following:

{
  "name" : "disc-files-dev",
  "fs" : {
    "url" : "V:\\ISC",
    "update_rate" : "120h",
	"indexed_chars" : 5000,
	"includes": [
      "*/*.doc",
      "*/*.pdf",
	  "*/*.csv",
	  "*/*.doc",
	  "*/*.docx",
	  "*/*.ods",
	  "*/*.odp",
	  "*/*.odt",
	  "*/*.pdf",
	  "*/*.pps",
	  "*/*.ppsx",
	  "*/*.ppt",
	  "*/*.pptx",
	  "*/*.rtf",
	  "*/*.txt",
	  "*/*.wps",
	  "*/*.xls",
	  "*/*.xlsx",
	  "*/*.xlsm",
	  "*/*.xps"
    ],
    "excludes": [
       "*/~*", 
	   "*/*.tmp",
	   "*/*.eml",
	   "*/*.jpg",
	   "*/*.png",	   
	   "*/ISC/XX*",
	],
    "json_support" : false,
	"follow_symlink" : false,
    "filename_as_id" : false,
    "add_filesize" : true,
    "remove_deleted" : true,
    "add_as_inner_object" : true,
    "store_source" : false,
    "index_content" : true,
    "attributes_support" : false,
    "raw_metadata" : true,
    "xml_support" : false,
    "index_folders" : true,
    "lang_detect" : false,
    "continue_on_error" : false,
	"ignore_above": "20mb",
    "pdf_ocr" : false,
    "ocr" : {
      "language" : "eng"
    }
  },
  "elasticsearch" : {
    "nodes" : [ {
      "url" : "https://elasticsearch-dsv.xxxxxx.com"
    } ],
    "bulk_size" : 100,
    "flush_interval" : "6s",
    "byte_size"   : "5mb",
    "username"    : "disc-all",
	"password"    : "xxxxxxx",
	"index"       : "disc-files-docs",
	"index_folder": "disc-files-folders"
  },
  "rest" : {
    "url" : "http://127.0.0.1:8080/fscrawler"
  }
}

I realized this error occurs always after a line like this is DEBUG information:

DEBUG [f.p.e.c.f.FsParserAbstract] Looking for removed files in...

There are no error logs in the Apache / Elasticsearch side.

Any help?

Thanks at all

1 Like

w00t! I have never tested FSCrawler with such numbers. That's great to see! :slight_smile:

So sadly something is going wrong between FSCrawler and Elasticsearch. For whatever reason, the connection is closed on Elasticsearch side when FSCrawler sends its request.

If you don't want to "watch" the directory and detect files that have been removed, you can change remove_deleted to false. If you don't need all the raw metadata, I'd also change raw_metadata to false (that will be the default in the next release).

That being said, I should definitely try to implement a retry mechanism at some point (ie. recreate a new client in case of failure). Would you like to open an issue for this?

Side note: so many improvements have been made in the SNAPSHOT version (2.7). I'd also encourage using the SNAPSHOT. I know that I need to release it eventually...

Thanks David,

FS Crawler running in trace mode brings more info:

19:50:41,934 DEBUG [f.p.e.c.f.FsParserAbstract] Looking for removed files in [V:\ISC\XXX\YYY\ZZZ\KKK\MMM\FFF]...
19:50:41,934 TRACE [f.p.e.c.f.FsParserAbstract] Querying elasticsearch for files in dir [path.root:edf493b81409d494ce8dde6d1cb2cca]
19:50:42,242 TRACE [f.p.e.c.f.c.v.ElasticsearchClientV6] Executed bulk request with [89] requests
19:50:43,935 WARN  [f.p.e.c.f.FsParserAbstract] Error while crawling V:\ISC: null
19:50:43,935 WARN  [f.p.e.c.f.FsParserAbstract] Full stacktrace
java.net.ConnectException: null
        at org.elasticsearch.client.RestClient$SyncResponseListener.get(RestClient.java:949) ~[elasticsearch-rest-client-6.5.3.jar:6.5.3]
        at org.elasticsearch.client.RestClient.performRequest(RestClient.java:229) ~[elasticsearch-rest-client-6.5.3.jar:6.5.3]
        at org.elasticsearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1593) ~[elasticsearch-rest-high-level-client-6.5.3.jar:6.5.3]
        at org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:1563) ~[elasticsearch-rest-high-level-client-6.5.3.jar:6.5.3]
        at org.elasticsearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:1525) ~[elasticsearch-rest-high-level-client-6.5.3.jar:6.5.3]
        at org.elasticsearch.client.RestHighLevelClient.search(RestHighLevelClient.java:990) ~[elasticsearch-rest-high-level-client-6.5.3.jar:6.5.3]
        at fr.pilato.elasticsearch.crawler.fs.client.v6.ElasticsearchClientV6.search(ElasticsearchClientV6.java:489) ~[fscrawler-elasticsearch-client-v6-2.6.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.getFileDirectory(FsParserAbstract.java:363) ~[fscrawler-core-2.6.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:317) ~[fscrawler-core-2.6.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:299) ~[fscrawler-core-2.6.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:299) ~[fscrawler-core-2.6.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:299) ~[fscrawler-core-2.6.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:299) ~[fscrawler-core-2.6.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:299) ~[fscrawler-core-2.6.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:299) ~[fscrawler-core-2.6.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:299) ~[fscrawler-core-2.6.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:299) ~[fscrawler-core-2.6.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:299) ~[fscrawler-core-2.6.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:299) ~[fscrawler-core-2.6.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.run(FsParserAbstract.java:157) [fscrawler-core-2.6.jar:?]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_211]
Caused by: java.net.ConnectException
        at org.apache.http.nio.pool.RouteSpecificPool.timeout(RouteSpecificPool.java:168) ~[httpcore-nio-4.4.5.jar:4.4.5]
        at org.apache.http.nio.pool.AbstractNIOConnPool.requestTimeout(AbstractNIOConnPool.java:561) ~[httpcore-nio-4.4.5.jar:4.4.5]
        at org.apache.http.nio.pool.AbstractNIOConnPool$InternalSessionRequestCallback.timeout(AbstractNIOConnPool.java:822) ~[httpcore-nio-4.4.5.jar:4.4.5]
        at org.apache.http.impl.nio.reactor.SessionRequestImpl.timeout(SessionRequestImpl.java:183) ~[httpcore-nio-4.4.5.jar:4.4.5]
        at org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processTimeouts(DefaultConnectingIOReactor.java:210) ~[httpcore-nio-4.4.5.jar:4.4.5]
        at org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processEvents(DefaultConnectingIOReactor.java:155) ~[httpcore-nio-4.4.5.jar:4.4.5]
        at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor.execute(AbstractMultiworkerIOReactor.java:348) ~[httpcore-nio-4.4.5.jar:4.4.5]
        at org.apache.http.impl.nio.conn.PoolingNHttpClientConnectionManager.execute(PoolingNHttpClientConnectionManager.java:192) ~[httpasyncclient-4.1.2.jar:4.1.2]
        at org.apache.http.impl.nio.client.CloseableHttpAsyncClientBase$1.run(CloseableHttpAsyncClientBase.java:64) ~[httpasyncclient-4.1.2.jar:4.1.2]
        ... 1 more
19:50:43,937 INFO  [f.p.e.c.f.FsParserAbstract] FS crawler is stopping after 1 run
19:50:43,999 DEBUG [f.p.e.c.f.FsCrawlerImpl] Closing FS crawler [disc-files-dev]
19:50:43,999 DEBUG [f.p.e.c.f.FsCrawlerImpl] FS crawler thread is now stopped
19:50:44,001 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV6] Closing Elasticsearch client manager
19:50:44,001 TRACE [f.p.e.c.f.c.v.ElasticsearchClientV6] Sending a bulk request of [22] requests
19:50:44,052 TRACE [f.p.e.c.f.c.v.ElasticsearchClientV6] Executed bulk request with [22] requests
19:50:44,055 DEBUG [f.p.e.c.f.FsCrawlerImpl] ES Client Manager stopped
19:50:44,055 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler [disc-files-dev] stopped
19:50:44,058 DEBUG [f.p.e.c.f.FsCrawlerImpl] Closing FS crawler [disc-files-dev]
19:50:44,058 DEBUG [f.p.e.c.f.FsCrawlerImpl] FS crawler thread is now stopped
19:50:44,058 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV6] Closing Elasticsearch client manager
19:50:44,058 DEBUG [f.p.e.c.f.FsCrawlerImpl] ES Client Manager stopped
19:50:44,058 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler [disc-files-dev] stopped

I'll open a issue and wait for the next FS Crawler version. I think the retry mechanism is too important in this case.

Thank you

Does it mean that you can reproduce it on every run?
If so, do you think it's possible to try the latest 2.7 version?

Not easy to reproduce. I run many times, but the error occurs in diferent moments of the process, diferent files and folders. For example, last run (in trace mode) took 1 day before stop by Connection Closed error. The previous one took 3,5 days before stop by the same error in another point of folders tree.

I'll check with infra team if I can try 2.7 version.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.