Errors while indexing mounted drive using fscrawler

Hi,

I have mounted sharpeoint site to a network drive (/mnt/sp) in centos.
Then I am indexing the mounted files using fscrawler.
Here is my settings file:

---
name: "index_45.79.189.33"
fs:
  url: "/mnt/sp/fsSharepointFiles"
  update_rate: "15m"
  excludes:
  - "*/~*"
  json_support: false
  filename_as_id: false
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: false
  store_source: false
  index_content: true
  attributes_support: false
  raw_metadata: false
  xml_support: false
  index_folders: true
  lang_detect: false
  continue_on_error: false
  ocr:
    language: "eng"
    enabled: true
    pdf_strategy: "ocr_and_text"
  follow_symlinks: false
elasticsearch:
  nodes:
  - url: "http://50.116.48.89:8881"
  username: ls
  password: lspass
  bulk_size: 100
  flush_interval: "5s"
  byte_size: "10mb"

when I create my index directory and index for the first time everything is working fine.
But when reindex using --restart below errors are coming.

```
[ls@li1288-33 fscrawler-es7-2.7-SNAPSHOT]$ bin/fscrawler --config_dir test_dir_45.79.189.33 index_45.79.189.33 --loop 1 --restart
03:03:47,737 INFO  [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [61.2mb/843mb=7.27%], RAM [1.8gb/3.7gb=49.26%], Swap [511.9mb/511.9mb=100.0%].
03:03:49,000 INFO  [f.p.e.c.f.c.v.ElasticsearchClientV7] Elasticsearch Client for version 7.x connected to a node running version 7.5.1
03:03:49,088 INFO  [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
03:03:49,448 INFO  [f.p.e.c.f.FsParserAbstract] FS crawler started for [index_45.79.189.33] for [/mnt/sp/fsSharepointFiles] every [15m]
03:04:39,529 WARN  [f.p.e.c.f.FsParserAbstract] Error while crawling /mnt/sp/fsSharepointFiles: /mnt/sp/fsSharepointFiles/fsSharepointfile1.txt (Resource temporarily unavailable)
03:04:39,529 INFO  [f.p.e.c.f.FsParserAbstract] FS crawler is stopping after 1 run
03:04:54,121 WARN  [f.p.e.c.f.c.v.ElasticsearchClientV7] Got a hard failure when executing the bulk request
java.net.SocketTimeoutException: 30,000 milliseconds timeout on connection http-outgoing-0 [ACTIVE]
        at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.timeout(HttpAsyncRequestExecutor.java:387) [httpcore-nio-4.4.13.jar:4.4.13]
        at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:92) [httpasyncclient-4.1.4.jar:4.1.4]
        at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:39) [httpasyncclient-4.1.4.jar:4.1.4]
        at org.apache.http.impl.nio.reactor.AbstractIODispatch.timeout(AbstractIODispatch.java:175) [httpcore-nio-4.4.13.jar:4.4.13]
        at org.apache.http.impl.nio.reactor.BaseIOReactor.sessionTimedOut(BaseIOReactor.java:261) [httpcore-nio-4.4.13.jar:4.4.13]
        at org.apache.http.impl.nio.reactor.AbstractIOReactor.timeoutCheck(AbstractIOReactor.java:502) [httpcore-nio-4.4.13.jar:4.4.13]
        at org.apache.http.impl.nio.reactor.BaseIOReactor.validate(BaseIOReactor.java:211) [httpcore-nio-4.4.13.jar:4.4.13]
        at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:280) [httpcore-nio-4.4.13.jar:4.4.13]
        at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104) [httpcore-nio-4.4.13.jar:4.4.13]
        at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:591) [httpcore-nio-4.4.13.jar:4.4.13]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_242]
03:04:54,134 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler [index_45.79.189.33] stopped
03:04:54,138 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler [index_45.79.189.33] stopped
[ls@li1288-33 fscrawler-es7-2.7-SNAPSHOT]$ ^C
[ls@li1288-33 fscrawler-es7-2.7-SNAPSHOT]$ bin/fscrawler --config_dir test_dir_45.79.189.33 index_45.79.189.33 --loop 1 --restart
03:11:35,231 INFO  [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [61.3mb/843mb=7.28%], RAM [1.8gb/3.7gb=49.24%], Swap [511.9mb/511.9mb=100.0%].
03:11:37,200 WARN  [f.p.e.c.f.c.v.ElasticsearchClientV7] failed to create elasticsearch client, disabling crawler...
03:11:37,200 FATAL [f.p.e.c.f.c.FsCrawlerCli] We can not start Elasticsearch Client. Exiting.
java.net.ConnectException: Timeout connecting to [/50.116.48.89:8881]
        at org.elasticsearch.client.RestClient.extractAndWrapCause(RestClient.java:823) ~[elasticsearch-rest-client-7.6.2.jar:7.6.2]
        at org.elasticsearch.client.RestClient.performRequest(RestClient.java:248) ~[elasticsearch-rest-client-7.6.2.jar:7.6.2]
        at org.elasticsearch.client.RestClient.performRequest(RestClient.java:235) ~[elasticsearch-rest-client-7.6.2.jar:7.6.2]
        at org.elasticsearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1514) ~[elasticsearch-rest-high-level-client-7.6.2.jar:7.6.2]
        at org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:1499) ~[elasticsearch-rest-high-level-client-7.6.2.jar:7.6.2]
        at org.elasticsearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:1466) ~[elasticsearch-rest-high-level-client-7.6.2.jar:7.6.2]
        at org.elasticsearch.client.RestHighLevelClient.info(RestHighLevelClient.java:730) ~[elasticsearch-rest-high-level-client-7.6.2.jar:7.6.2]
        at fr.pilato.elasticsearch.crawler.fs.client.v7.ElasticsearchClientV7.getVersion(ElasticsearchClientV7.java:169) ~[fscrawler-elasticsearch-client-v7-2.7-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.client.ElasticsearchClient.checkVersion(ElasticsearchClient.java:181) ~[fscrawler-elasticsearch-client-base-2.7-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.client.v7.ElasticsearchClientV7.start(ElasticsearchClientV7.java:142) ~[fscrawler-elasticsearch-client-v7-2.7-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.cli.FsCrawlerCli.main(FsCrawlerCli.java:257) [fscrawler-cli-2.7-SNAPSHOT.jar:?]
Caused by: java.net.ConnectException: Timeout connecting to [/50.116.48.89:8881]
        at org.apache.http.nio.pool.RouteSpecificPool.timeout(RouteSpecificPool.java:169) ~[httpcore-nio-4.4.13.jar:4.4.13]
        at org.apache.http.nio.pool.AbstractNIOConnPool.requestTimeout(AbstractNIOConnPool.java:632) ~[httpcore-nio-4.4.13.jar:4.4.13]
        at org.apache.http.nio.pool.AbstractNIOConnPool$InternalSessionRequestCallback.timeout(AbstractNIOConnPool.java:898) ~[httpcore-nio-4.4.13.jar:4.4.13]
        at org.apache.http.impl.nio.reactor.SessionRequestImpl.timeout(SessionRequestImpl.java:198) ~[httpcore-nio-4.4.13.jar:4.4.13]
        at org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processTimeouts(DefaultConnectingIOReactor.java:213) ~[httpcore-nio-4.4.13.jar:4.4.13]
        at org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processEvents(DefaultConnectingIOReactor.java:158) ~[httpcore-nio-4.4.13.jar:4.4.13]
        at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor.execute(AbstractMultiworkerIOReactor.java:351) ~[httpcore-nio-4.4.13.jar:4.4.13]
        at org.apache.http.impl.nio.conn.PoolingNHttpClientConnectionManager.execute(PoolingNHttpClientConnectionManager.java:221) ~[httpasyncclient-4.1.4.jar:4.1.4]
        at org.apache.http.impl.nio.client.CloseableHttpAsyncClientBase$1.run(CloseableHttpAsyncClientBase.java:64) ~[httpasyncclient-4.1.4.jar:4.1.4]
        at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_242]
03:11:37,215 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler [index_45.79.189.33] stopped
03:11:37,216 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler [index_45.79.189.33] stopped

```

Again if I delete the index directory and start the indexing, no errors are coming.

[ls@li1288-33 fscrawler-es7-2.7-SNAPSHOT]$ rm -rf test_dir_45.79.189.33/
[ls@li1288-33 fscrawler-es7-2.7-SNAPSHOT]$ ls
bin  lib  LICENSE  NOTICE  README.md  test_dir_sp_linux
[ls@li1288-33 fscrawler-es7-2.7-SNAPSHOT]$ bin/fscrawler --config_dir test_dir_45.79.189.33 index_45.79.189.33 --loop 1
03:25:16,631 INFO  [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [61.3mb/843mb=7.28%], RAM [1.8gb/3.7gb=49.21%], Swap [511.9mb/511.9mb=100.0%].
03:25:16,650 WARN  [f.p.e.c.f.c.FsCrawlerCli] job [index_45.79.189.33] does not exist
03:25:16,651 INFO  [f.p.e.c.f.c.FsCrawlerCli] Do you want to create it (Y/N)?
y
03:25:19,614 INFO  [f.p.e.c.f.c.FsCrawlerCli] Settings have been created in [test_dir_45.79.189.33/index_45.79.189.33/_settings.yaml]. Please review and edit before relaunch
[ls@li1288-33 fscrawler-es7-2.7-SNAPSHOT]$ vim test_dir_45.79.189.33/index_45.79.189.33/_settings.yaml
[ls@li1288-33 fscrawler-es7-2.7-SNAPSHOT]$ bin/fscrawler --config_dir test_dir_45.79.189.33 index_45.79.189.33 --loop 1 --restart
03:28:57,555 INFO  [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [61.2mb/843mb=7.27%], RAM [1.8gb/3.7gb=49.25%], Swap [511.9mb/511.9mb=100.0%].
03:28:58,727 INFO  [f.p.e.c.f.c.v.ElasticsearchClientV7] Elasticsearch Client for version 7.x connected to a node running version 7.5.1
03:28:58,841 INFO  [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
03:28:59,209 INFO  [f.p.e.c.f.FsParserAbstract] FS crawler started for [index_45.79.189.33] for [/mnt/sp/fsSharepointFiles] every [15m]
03:28:59,791 WARN  [o.a.t.p.PDFParser] J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

03:29:00,745 INFO  [f.p.e.c.f.FsParserAbstract] FS crawler is stopping after 1 run
03:29:00,914 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler [index_45.79.189.33] stopped

Could you please tell me why re-indexing is not working properly?

-Lisa

Weird.
Is there anything in elasticsearch logs?

You could try to decrease the bulk size to 10 to see if things are getting better?

Hi David,

I tried with bulk size = 5 but same this is happening.

[ls@li1288-33 fscrawler-es7-2.7-SNAPSHOT]$ bin/fscrawler --config_dir --debug test_dir_45.79.189.33 index_45.79.189.33 --loop 1 --restart
03:56:18,992 INFO  [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [61.7mb/843mb=7.32%], RAM [1.8gb/3.7gb=49.35%], Swap [511.9mb/511.9mb=100.0%].
03:56:20,127 WARN  [f.p.e.c.f.c.v.ElasticsearchClientV7] failed to create elasticsearch client, disabling crawler...
03:56:20,127 FATAL [f.p.e.c.f.c.FsCrawlerCli] We can not start Elasticsearch Client. Exiting.
java.net.ConnectException: Connection refused
        at org.elasticsearch.client.RestClient.extractAndWrapCause(RestClient.java:823) ~[elasticsearch-rest-client-7.6.2.jar:7.6.2]
        at org.elasticsearch.client.RestClient.performRequest(RestClient.java:248) ~[elasticsearch-rest-client-7.6.2.jar:7.6.2]
        at org.elasticsearch.client.RestClient.performRequest(RestClient.java:235) ~[elasticsearch-rest-client-7.6.2.jar:7.6.2]
        at org.elasticsearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1514) ~[elasticsearch-rest-high-level-client-7.6.2.jar:7.6.2]
        at org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:1499) ~[elasticsearch-rest-high-level-client-7.6.2.jar:7.6.2]
        at org.elasticsearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:1466) ~[elasticsearch-rest-high-level-client-7.6.2.jar:7.6.2]
        at org.elasticsearch.client.RestHighLevelClient.info(RestHighLevelClient.java:730) ~[elasticsearch-rest-high-level-client-7.6.2.jar:7.6.2]
        at fr.pilato.elasticsearch.crawler.fs.client.v7.ElasticsearchClientV7.getVersion(ElasticsearchClientV7.java:169) ~[fscrawler-elasticsearch-client-v7-2.7-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.client.ElasticsearchClient.checkVersion(ElasticsearchClient.java:181) ~[fscrawler-elasticsearch-client-base-2.7-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.client.v7.ElasticsearchClientV7.start(ElasticsearchClientV7.java:142) ~[fscrawler-elasticsearch-client-v7-2.7-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.cli.FsCrawlerCli.main(FsCrawlerCli.java:257) [fscrawler-cli-2.7-SNAPSHOT.jar:?]
Caused by: java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:1.8.0_242]
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:714) ~[?:1.8.0_242]
        at org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processEvent(DefaultConnectingIOReactor.java:174) ~[httpcore-nio-4.4.13.jar:4.4.13]
        at org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processEvents(DefaultConnectingIOReactor.java:148) ~[httpcore-nio-4.4.13.jar:4.4.13]
        at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor.execute(AbstractMultiworkerIOReactor.java:351) ~[httpcore-nio-4.4.13.jar:4.4.13]
        at org.apache.http.impl.nio.conn.PoolingNHttpClientConnectionManager.execute(PoolingNHttpClientConnectionManager.java:221) ~[httpasyncclient-4.1.4.jar:4.1.4]
        at org.apache.http.impl.nio.client.CloseableHttpAsyncClientBase$1.run(CloseableHttpAsyncClientBase.java:64) ~[httpasyncclient-4.1.4.jar:4.1.4]
        at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_242]
03:56:20,134 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler [test_dir_45.79.189.33] stopped
03:56:20,136 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler [test_dir_45.79.189.33] stopped

I am not sure about my elastic search logs. are they supposed to be in /var/log/elasticsearch ?
Here is what I found in /var/log/elasticsearch/elasticsearch.log

[2020-04-16T01:30:00,001][INFO ][o.e.x.m.MlDailyMaintenanceService] [li393-89.members.linode.com] triggering scheduled [ML] maintenance tasks
[2020-04-16T01:30:00,011][INFO ][o.e.x.m.a.TransportDeleteExpiredDataAction] [li393-89.members.linode.com] Deleting expired data
[2020-04-16T01:30:00,013][INFO ][o.e.x.s.SnapshotRetentionTask] [li393-89.members.linode.com] starting SLM retention snapshot cleanup task
[2020-04-16T01:30:00,019][INFO ][o.e.x.m.a.TransportDeleteExpiredDataAction] [li393-89.members.linode.com] Completed deletion of expired ML data
[2020-04-16T01:30:00,020][INFO ][o.e.x.m.MlDailyMaintenanceService] [li393-89.members.linode.com] Successfully completed [ML] maintenance tasks
[2020-04-16T01:56:11,290][INFO ][o.e.c.m.MetaDataCreateIndexService] [li393-89.members.linode.com] [index_45.79.189.33] creating index, cause [api], templates [], shards [1]/[1], mappings [_doc]
[2020-04-16T01:56:11,454][INFO ][o.e.c.m.MetaDataCreateIndexService] [li393-89.members.linode.com] [index_45.79.189.33_folder] creating index, cause [api], templates [], shards [1]/[1], mappings [_doc]
[2020-04-16T03:01:34,106][INFO ][o.e.c.m.MetaDataDeleteIndexService] [li393-89.members.linode.com] [index_45.79.189.33/4dnt8QBwRZW0-Varnd_vlA] deleting index
[2020-04-16T03:01:53,659][INFO ][o.e.c.m.MetaDataCreateIndexService] [li393-89.members.linode.com] [index_45.79.189.33] creating index, cause [api], templates [], shards [1]/[1], mappings [_doc]
[2020-04-16T03:28:25,262][INFO ][o.e.c.m.MetaDataMappingService] [li393-89.members.linode.com] [index_45.79.189.33/vQqoPbJ0QPW75O9zYiy2PA] update_mapping [_doc]
[2020-04-16T03:41:36,694][INFO ][o.e.c.m.MetaDataDeleteIndexService] [li393-89.members.linode.com] [index_45.79.189.33/vQqoPbJ0QPW75O9zYiy2PA] deleting index
[2020-04-16T03:42:03,165][INFO ][o.e.c.m.MetaDataCreateIndexService] [li393-89.members.linode.com] [index_45.79.189.33] creating index, cause [api], templates [], shards [1]/[1], mappings [_doc]

It's not the same error. Now FSCrawler can not start at all as it can't connect to the elasticsearch server. Is elasticsearch still running?

Hi David,

Checked the elastic search service, its running!
The server where I am running fscrawler is connected through VPN to the sharepoint VM. Some times when i ping my elasticsearch server, it was not reachable due to VPN i think. So I just stopped VPN while indexing, that worked!!

Thanks,
Lisa

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.