Error while indexing documents into ES using Fscrawler


(Jasmeet) #1

Hi, I am using Fscrawler to index a large set of documents kept in varous folders. I have created separate jobs for all the major folders and i run each job in Fscrawler. Some of the folders are quite large (>180 Gb) and contain some sub folders also for which creating individual jobs is very cumbersome process. In one such folder, I ran FScrawaler and after running for the entire day it gave an error which I am reproducing below. Can someone pl guide as to why the error is coming and how to resolve it.
2. I had earlier also run the crawler on the same folder and got an error, so Fscrawler tries to reindex every document/folder from the beginning every time it is started as there is no status.json file created if the crawler exits with an error.
Thanks
JS
// ERROR

16:37:08,809 DEBUG [f.p.e.c.f.c.FileAbstractor] Listing local files from E:\MISC FILES\IT Attendance System\Record of Attendance\IT attendance\2018\December 2018
16:37:08,809 DEBUG [f.p.e.c.f.c.FileAbstractor] 0 local files found
16:37:08,809 DEBUG [f.p.e.c.f.FsParser] Looking for removed files in [E:\MISC FILES\IT Attendance System\Record of Attendance\IT attendance\2018\December 2018]...
16:37:08,809 TRACE [f.p.e.c.f.FsParser] Querying elasticsearch for files in dir [path.root:4fcf95a09d4a7831e5910733dbaa]
16:37:12,982 TRACE [f.p.e.c.f.c.ElasticsearchClientManager] Sending a bulk request of [1] requests
16:37:13,581 TRACE [f.p.e.c.f.c.ElasticsearchClientManager] Sending a bulk request of [1] requests
16:37:34,027 WARN [f.p.e.c.f.c.ElasticsearchClientManager] Got a hard failure when executing the bulk request
java.net.SocketException: null
at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.timeout(HttpAsyncRequestExecutor.java:375) [httpcore-nio-4.4.5.jar:4.4.5]
at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:92) [httpasyncclient-4.1.2.jar:4.1.2]
at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:39) [httpasyncclient-4.1.2.jar:4.1.2]
at org.apache.http.impl.nio.reactor.AbstractIODispatch.timeout(AbstractIODispatch.java:175) [httpcore-nio-4.4.5.jar:4.4.5]
at org.apache.http.impl.nio.reactor.BaseIOReactor.sessionTimedOut(BaseIOReactor.java:263) [httpcore-nio-4.4.5.jar:4.4.5]
at org.apache.http.impl.nio.reactor.AbstractIOReactor.timeoutCheck(AbstractIOReactor.java:492) [httpcore-nio-4.4.5.jar:4.4.5]
at org.apache.http.impl.nio.reactor.BaseIOReactor.validate(BaseIOReactor.java:213) [httpcore-nio-4.4.5.jar:4.4.5]
at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:280) [httpcore-nio-4.4.5.jar:4.4.5]
at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104) [httpcore-nio-4.4.5.jar:4.4.5]
at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:588) [httpcore-nio-4.4.5.jar:4.4.5]
at java.lang.Thread.run(Unknown Source) [?:1.8.0_171]
16:37:38,927 WARN [f.p.e.c.f.FsParser] Error while crawling E:\MISC FILES: listener timeout after waiting for [30000] ms
16:37:38,999 WARN [f.p.e.c.f.FsParser] Full stacktrace
java.io.IOException: listener timeout after waiting for [30000] ms
at org.elasticsearch.client.RestClient$SyncResponseListener.get(RestClient.java:684) ~[elasticsearch-rest-client-6.3.2.jar:6.3.2]
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:235) ~[elasticsearch-rest-client-6.3.2.jar:6.3.2]
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:198) ~[elasticsearch-rest-client-6.3.2.jar:6.3.2]
at org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:522) ~[elasticsearch-rest-high-level-client-6.3.2.jar:6.3.2]
at org.elasticsearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:508) ~[elasticsearch-rest-high-level-client-6.3.2.jar:6.3.2]
at org.elasticsearch.client.RestHighLevelClient.search(RestHighLevelClient.java:404) ~[elasticsearch-rest-high-level-client-6.3.2.jar:6.3.2]
at fr.pilato.elasticsearch.crawler.fs.FsParser.getFileDirectory(FsParser.java:356) ~[fscrawler-core-2.5.jar:?]
at fr.pilato.elasticsearch.crawler.fs.FsParser.addFilesRecursively(FsParser.java:307) ~[fscrawler-core-2.5.jar:?]
at fr.pilato.elasticsearch.crawler.fs.FsParser.addFilesRecursively(FsParser.java:290) ~[fscrawler-core-2.5.jar:?]
at fr.pilato.elasticsearch.crawler.fs.FsParser.addFilesRecursively(FsParser.java:290) ~[fscrawler-core-2.5.jar:?]
at fr.pilato.elasticsearch.crawler.fs.FsParser.addFilesRecursively(FsParser.java:290) ~[fscrawler-core-2.5.jar:?]
at fr.pilato.elasticsearch.crawler.fs.FsParser.addFilesRecursively(FsParser.java:290) ~[fscrawler-core-2.5.jar:?]
at fr.pilato.elasticsearch.crawler.fs.FsParser.addFilesRecursively(FsParser.java:290) ~[fscrawler-core-2.5.jar:?]
at fr.pilato.elasticsearch.crawler.fs.FsParser.run(FsParser.java:167) [fscrawler-core-2.5.jar:?]
at java.lang.Thread.run(Unknown Source) [?:1.8.0_171]
16:37:39,139 INFO [f.p.e.c.f.FsParser] FS crawler is stopping after 1 run
16:37:39,428 DEBUG [f.p.e.c.f.FsCrawlerImpl] Closing FS crawler [e_drive_misc]
16:37:39,505 DEBUG [f.p.e.c.f.FsCrawlerImpl] FS crawler thread is now stopped
16:37:39,505 DEBUG [f.p.e.c.f.c.ElasticsearchClientManager] Closing Elasticsearch client manager
16:37:44,226 WARN [f.p.e.c.f.c.ElasticsearchClientManager] Got a hard failure when executing the bulk request
java.net.SocketTimeoutException: null
at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.timeout(HttpAsyncRequestExecutor.java:375) [httpcore-nio-4.4.5.jar:4.4.5]
at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:92) [httpasyncclient-4.1.2.jar:4.1.2]
at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:39) [httpasyncclient-4.1.2.jar:4.1.2]

16:38:02,138 TRACE [f.p.e.c.f.c.ElasticsearchClientManager] Executed bulk request with [1] requests
16:38:02,328 DEBUG [f.p.e.c.f.c.ElasticsearchClient] Closing REST client
16:38:02,629 DEBUG [f.p.e.c.f.FsCrawlerImpl] ES Client Manager stopped
16:38:02,629 INFO [f.p.e.c.f.FsCrawlerImpl] FS crawler [e_drive_misc] stopped
16:38:03,321 DEBUG [f.p.e.c.f.FsCrawlerImpl] Closing FS crawler [e_drive_misc]
16:38:03,422 DEBUG [f.p.e.c.f.FsCrawlerImpl] FS crawler thread is now stopped
16:38:03,422 DEBUG [f.p.e.c.f.c.ElasticsearchClientManager] Closing Elasticsearch client manager
16:38:03,422 DEBUG [f.p.e.c.f.c.ElasticsearchClient] Closing REST client
16:38:03,422 DEBUG [f.p.e.c.f.FsCrawlerImpl] ES Client Manager stopped
16:38:03,422 INFO [f.p.e.c.f.FsCrawlerImpl] FS crawler [e_drive_misc] stopped
//


(David Pilato) #2

What are your FSCrawler settings for this job?

Please format your code, logs or configuration files using </> icon as explained in this guide and not the citation button. It will make your post more readable.

Or use markdown style like:

```
CODE
```

This is the icon to use if you are not using markdown format:

There's a live preview panel for exactly this reasons.


(Jasmeet) #3

Hi, sorry for the delay in responding. My settings are given below:


  "name" : "test", "fs" : { "url" :"E:\\MISC FILES"
"update_rate" : "5m",
    "excludes" : [ "*/~*" ], "json_support" : false, "filename_as_id" : false, "add_filesize" : true, "remove_deleted" : true, "add_as_inner_object" : false, "store_source" : false, "index_content" : true, "attributes_support" : false, "raw_metadata" : true, "xml_support" : false, "index_folders" : true, "lang_detect" : false, "continue_on_error" : false, "pdf_ocr" : true, "ocr" : { "language" : "eng" } }, "elasticsearch" : { "nodes" : [ { "host" : "127.0.0.1", "port" : 9200, "scheme" : "HTTP" } ], "bulk_size" : 100, "flush_interval" : "5s", "byte_size" : "10mb" }, "rest" : { "scheme" : "HTTP", "host" : "127.0.0.1", "port" : 8080, "endpoint" : "fscrawler" } }

(David Pilato) #4

Could you please format correctly your json so it will be readable?


(Jasmeet) #5

I am actually doing it from a mobile phone as I am traveling and don't have access to a computer. Tried to format using </>

Preformatted text{
"name" : "test",
"fs" : {
"url" : "C:\Users\Sun\Documents\test",
"update_rate" : "5m",
"excludes" : [ "/~" ],
"json_support" : false,
"filename_as_id" : false,
"add_filesize" : true,
"remove_deleted" : true,
"add_as_inner_object" : false,
"store_source" : false,
"index_content" : true,
"attributes_support" : false,
"raw_metadata" : true,
"xml_support" : false,
"index_folders" : true,
"lang_detect" : false,
"continue_on_error" : false,
"pdf_ocr" : true,
"ocr" : {
"language" : "eng"
}
},
"elasticsearch" : {
"nodes" : [ {
"host" : "127.0.0.1",
"port" : 9200,
"scheme" : "HTTP"
} ],
"bulk_size" : 100,
"flush_interval" : "5s",
"byte_size" : "10mb"
},
"rest" : {
"scheme" : "HTTP",
"host" : "127.0.0.1",
"port" : 8080,
"endpoint" : "fscrawler"
}
}
indent preformatted text by 4 spaces


(David Pilato) #6

Please format your code, logs or configuration files using </> icon as explained in this guide and not the citation button. It will make your post more readable.

Or use markdown style like:

```
CODE
```

This is the icon to use if you are not using markdown format:

There's a live preview panel for exactly this reasons.

Lots of people read these forums, and many of them will simply skip over a post that is difficult to read, because it's just too large an investment of their time to try and follow a wall of badly formatted text.
If your goal is to get an answer to your questions, it's in your interest to make it as easy to read and understand as possible.
Please update your post.


(system) #7

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.