Errors while indexing mounted drive using fscrawler

Lisahtwy · April 16, 2020, 3:36am

Hi,

I have mounted sharpeoint site to a network drive (/mnt/sp) in centos.
Then I am indexing the mounted files using fscrawler.
Here is my settings file:

---
name: "index_45.79.189.33"
fs:
  url: "/mnt/sp/fsSharepointFiles"
  update_rate: "15m"
  excludes:
  - "*/~*"
  json_support: false
  filename_as_id: false
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: false
  store_source: false
  index_content: true
  attributes_support: false
  raw_metadata: false
  xml_support: false
  index_folders: true
  lang_detect: false
  continue_on_error: false
  ocr:
    language: "eng"
    enabled: true
    pdf_strategy: "ocr_and_text"
  follow_symlinks: false
elasticsearch:
  nodes:
  - url: "http://50.116.48.89:8881"
  username: ls
  password: lspass
  bulk_size: 100
  flush_interval: "5s"
  byte_size: "10mb"

when I create my index directory and index for the first time everything is working fine.
But when reindex using --restart below errors are coming.

```
[ls@li1288-33 fscrawler-es7-2.7-SNAPSHOT]$ bin/fscrawler --config_dir test_dir_45.79.189.33 index_45.79.189.33 --loop 1 --restart
03:03:47,737 INFO  [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [61.2mb/843mb=7.27%], RAM [1.8gb/3.7gb=49.26%], Swap [511.9mb/511.9mb=100.0%].
03:03:49,000 INFO  [f.p.e.c.f.c.v.ElasticsearchClientV7] Elasticsearch Client for version 7.x connected to a node running version 7.5.1
03:03:49,088 INFO  [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
03:03:49,448 INFO  [f.p.e.c.f.FsParserAbstract] FS crawler started for [index_45.79.189.33] for [/mnt/sp/fsSharepointFiles] every [15m]
03:04:39,529 WARN  [f.p.e.c.f.FsParserAbstract] Error while crawling /mnt/sp/fsSharepointFiles: /mnt/sp/fsSharepointFiles/fsSharepointfile1.txt (Resource temporarily unavailable)
03:04:39,529 INFO  [f.p.e.c.f.FsParserAbstract] FS crawler is stopping after 1 run
03:04:54,121 WARN  [f.p.e.c.f.c.v.ElasticsearchClientV7] Got a hard failure when executing the bulk request
java.net.SocketTimeoutException: 30,000 milliseconds timeout on connection http-outgoing-0 [ACTIVE]
        at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.timeout(HttpAsyncRequestExecutor.java:387) [httpcore-nio-4.4.13.jar:4.4.13]
        at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:92) [httpasyncclient-4.1.4.jar:4.1.4]
        at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:39) [httpasyncclient-4.1.4.jar:4.1.4]
        at org.apache.http.impl.nio.reactor.AbstractIODispatch.timeout(AbstractIODispatch.java:175) [httpcore-nio-4.4.13.jar:4.4.13]
        at org.apache.http.impl.nio.reactor.BaseIOReactor.sessionTimedOut(BaseIOReactor.java:261) [httpcore-nio-4.4.13.jar:4.4.13]
        at org.apache.http.impl.nio.reactor.AbstractIOReactor.timeoutCheck(AbstractIOReactor.java:502) [httpcore-nio-4.4.13.jar:4.4.13]
        at org.apache.http.impl.nio.reactor.BaseIOReactor.validate(BaseIOReactor.java:211) [httpcore-nio-4.4.13.jar:4.4.13]
        at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:280) [httpcore-nio-4.4.13.jar:4.4.13]
        at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104) [httpcore-nio-4.4.13.jar:4.4.13]
        at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:591) [httpcore-nio-4.4.13.jar:4.4.13]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_242]
03:04:54,134 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler [index_45.79.189.33] stopped
03:04:54,138 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler [index_45.79.189.33] stopped
[ls@li1288-33 fscrawler-es7-2.7-SNAPSHOT]$ ^C
[ls@li1288-33 fscrawler-es7-2.7-SNAPSHOT]$ bin/fscrawler --config_dir test_dir_45.79.189.33 index_45.79.189.33 --loop 1 --restart
03:11:35,231 INFO  [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [61.3mb/843mb=7.28%], RAM [1.8gb/3.7gb=49.24%], Swap [511.9mb/511.9mb=100.0%].
03:11:37,200 WARN  [f.p.e.c.f.c.v.ElasticsearchClientV7] failed to create elasticsearch client, disabling crawler...
03:11:37,200 FATAL [f.p.e.c.f.c.FsCrawlerCli] We can not start Elasticsearch Client. Exiting.
java.net.ConnectException: Timeout connecting to [/50.116.48.89:8881]
        at org.elasticsearch.client.RestClient.extractAndWrapCause(RestClient.java:823) ~[elasticsearch-rest-client-7.6.2.jar:7.6.2]
        at org.elasticsearch.client.RestClient.performRequest(RestClient.java:248) ~[elasticsearch-rest-client-7.6.2.jar:7.6.2]
        at org.elasticsearch.client.RestClient.performRequest(RestClient.java:235) ~[elasticsearch-rest-client-7.6.2.jar:7.6.2]
        at org.elasticsearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1514) ~[elasticsearch-rest-high-level-client-7.6.2.jar:7.6.2]
        at org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:1499) ~[elasticsearch-rest-high-level-client-7.6.2.jar:7.6.2]
        at org.elasticsearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:1466) ~[elasticsearch-rest-high-level-client-7.6.2.jar:7.6.2]
        at org.elasticsearch.client.RestHighLevelClient.info(RestHighLevelClient.java:730) ~[elasticsearch-rest-high-level-client-7.6.2.jar:7.6.2]
        at fr.pilato.elasticsearch.crawler.fs.client.v7.ElasticsearchClientV7.getVersion(ElasticsearchClientV7.java:169) ~[fscrawler-elasticsearch-client-v7-2.7-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.client.ElasticsearchClient.checkVersion(ElasticsearchClient.java:181) ~[fscrawler-elasticsearch-client-base-2.7-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.client.v7.ElasticsearchClientV7.start(ElasticsearchClientV7.java:142) ~[fscrawler-elasticsearch-client-v7-2.7-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.cli.FsCrawlerCli.main(FsCrawlerCli.java:257) [fscrawler-cli-2.7-SNAPSHOT.jar:?]
Caused by: java.net.ConnectException: Timeout connecting to [/50.116.48.89:8881]
        at org.apache.http.nio.pool.RouteSpecificPool.timeout(RouteSpecificPool.java:169) ~[httpcore-nio-4.4.13.jar:4.4.13]
        at org.apache.http.nio.pool.AbstractNIOConnPool.requestTimeout(AbstractNIOConnPool.java:632) ~[httpcore-nio-4.4.13.jar:4.4.13]
        at org.apache.http.nio.pool.AbstractNIOConnPool$InternalSessionRequestCallback.timeout(AbstractNIOConnPool.java:898) ~[httpcore-nio-4.4.13.jar:4.4.13]
        at org.apache.http.impl.nio.reactor.SessionRequestImpl.timeout(SessionRequestImpl.java:198) ~[httpcore-nio-4.4.13.jar:4.4.13]
        at org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processTimeouts(DefaultConnectingIOReactor.java:213) ~[httpcore-nio-4.4.13.jar:4.4.13]
        at org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processEvents(DefaultConnectingIOReactor.java:158) ~[httpcore-nio-4.4.13.jar:4.4.13]
        at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor.execute(AbstractMultiworkerIOReactor.java:351) ~[httpcore-nio-4.4.13.jar:4.4.13]
        at org.apache.http.impl.nio.conn.PoolingNHttpClientConnectionManager.execute(PoolingNHttpClientConnectionManager.java:221) ~[httpasyncclient-4.1.4.jar:4.1.4]
        at org.apache.http.impl.nio.client.CloseableHttpAsyncClientBase$1.run(CloseableHttpAsyncClientBase.java:64) ~[httpasyncclient-4.1.4.jar:4.1.4]
        at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_242]
03:11:37,215 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler [index_45.79.189.33] stopped
03:11:37,216 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler [index_45.79.189.33] stopped

```

Again if I delete the index directory and start the indexing, no errors are coming.

[ls@li1288-33 fscrawler-es7-2.7-SNAPSHOT]$ rm -rf test_dir_45.79.189.33/
[ls@li1288-33 fscrawler-es7-2.7-SNAPSHOT]$ ls
bin  lib  LICENSE  NOTICE  README.md  test_dir_sp_linux
[ls@li1288-33 fscrawler-es7-2.7-SNAPSHOT]$ bin/fscrawler --config_dir test_dir_45.79.189.33 index_45.79.189.33 --loop 1
03:25:16,631 INFO  [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [61.3mb/843mb=7.28%], RAM [1.8gb/3.7gb=49.21%], Swap [511.9mb/511.9mb=100.0%].
03:25:16,650 WARN  [f.p.e.c.f.c.FsCrawlerCli] job [index_45.79.189.33] does not exist
03:25:16,651 INFO  [f.p.e.c.f.c.FsCrawlerCli] Do you want to create it (Y/N)?
y
03:25:19,614 INFO  [f.p.e.c.f.c.FsCrawlerCli] Settings have been created in [test_dir_45.79.189.33/index_45.79.189.33/_settings.yaml]. Please review and edit before relaunch
[ls@li1288-33 fscrawler-es7-2.7-SNAPSHOT]$ vim test_dir_45.79.189.33/index_45.79.189.33/_settings.yaml
[ls@li1288-33 fscrawler-es7-2.7-SNAPSHOT]$ bin/fscrawler --config_dir test_dir_45.79.189.33 index_45.79.189.33 --loop 1 --restart
03:28:57,555 INFO  [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [61.2mb/843mb=7.27%], RAM [1.8gb/3.7gb=49.25%], Swap [511.9mb/511.9mb=100.0%].
03:28:58,727 INFO  [f.p.e.c.f.c.v.ElasticsearchClientV7] Elasticsearch Client for version 7.x connected to a node running version 7.5.1
03:28:58,841 INFO  [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
03:28:59,209 INFO  [f.p.e.c.f.FsParserAbstract] FS crawler started for [index_45.79.189.33] for [/mnt/sp/fsSharepointFiles] every [15m]
03:28:59,791 WARN  [o.a.t.p.PDFParser] J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

03:29:00,745 INFO  [f.p.e.c.f.FsParserAbstract] FS crawler is stopping after 1 run
03:29:00,914 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler [index_45.79.189.33] stopped

Could you please tell me why re-indexing is not working properly?

-Lisa

dadoonet · April 16, 2020, 3:49am

Weird.
Is there anything in elasticsearch logs?

You could try to decrease the bulk size to 10 to see if things are getting better?

Lisahtwy · April 16, 2020, 4:02am

Hi David,

I tried with bulk size = 5 but same this is happening.

[ls@li1288-33 fscrawler-es7-2.7-SNAPSHOT]$ bin/fscrawler --config_dir --debug test_dir_45.79.189.33 index_45.79.189.33 --loop 1 --restart
03:56:18,992 INFO  [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [61.7mb/843mb=7.32%], RAM [1.8gb/3.7gb=49.35%], Swap [511.9mb/511.9mb=100.0%].
03:56:20,127 WARN  [f.p.e.c.f.c.v.ElasticsearchClientV7] failed to create elasticsearch client, disabling crawler...
03:56:20,127 FATAL [f.p.e.c.f.c.FsCrawlerCli] We can not start Elasticsearch Client. Exiting.
java.net.ConnectException: Connection refused
        at org.elasticsearch.client.RestClient.extractAndWrapCause(RestClient.java:823) ~[elasticsearch-rest-client-7.6.2.jar:7.6.2]
        at org.elasticsearch.client.RestClient.performRequest(RestClient.java:248) ~[elasticsearch-rest-client-7.6.2.jar:7.6.2]
        at org.elasticsearch.client.RestClient.performRequest(RestClient.java:235) ~[elasticsearch-rest-client-7.6.2.jar:7.6.2]
        at org.elasticsearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1514) ~[elasticsearch-rest-high-level-client-7.6.2.jar:7.6.2]
        at org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:1499) ~[elasticsearch-rest-high-level-client-7.6.2.jar:7.6.2]
        at org.elasticsearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:1466) ~[elasticsearch-rest-high-level-client-7.6.2.jar:7.6.2]
        at org.elasticsearch.client.RestHighLevelClient.info(RestHighLevelClient.java:730) ~[elasticsearch-rest-high-level-client-7.6.2.jar:7.6.2]
        at fr.pilato.elasticsearch.crawler.fs.client.v7.ElasticsearchClientV7.getVersion(ElasticsearchClientV7.java:169) ~[fscrawler-elasticsearch-client-v7-2.7-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.client.ElasticsearchClient.checkVersion(ElasticsearchClient.java:181) ~[fscrawler-elasticsearch-client-base-2.7-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.client.v7.ElasticsearchClientV7.start(ElasticsearchClientV7.java:142) ~[fscrawler-elasticsearch-client-v7-2.7-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.cli.FsCrawlerCli.main(FsCrawlerCli.java:257) [fscrawler-cli-2.7-SNAPSHOT.jar:?]
Caused by: java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:1.8.0_242]
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:714) ~[?:1.8.0_242]
        at org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processEvent(DefaultConnectingIOReactor.java:174) ~[httpcore-nio-4.4.13.jar:4.4.13]
        at org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processEvents(DefaultConnectingIOReactor.java:148) ~[httpcore-nio-4.4.13.jar:4.4.13]
        at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor.execute(AbstractMultiworkerIOReactor.java:351) ~[httpcore-nio-4.4.13.jar:4.4.13]
        at org.apache.http.impl.nio.conn.PoolingNHttpClientConnectionManager.execute(PoolingNHttpClientConnectionManager.java:221) ~[httpasyncclient-4.1.4.jar:4.1.4]
        at org.apache.http.impl.nio.client.CloseableHttpAsyncClientBase$1.run(CloseableHttpAsyncClientBase.java:64) ~[httpasyncclient-4.1.4.jar:4.1.4]
        at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_242]
03:56:20,134 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler [test_dir_45.79.189.33] stopped
03:56:20,136 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler [test_dir_45.79.189.33] stopped

I am not sure about my elastic search logs. are they supposed to be in /var/log/elasticsearch ?
Here is what I found in /var/log/elasticsearch/elasticsearch.log

[2020-04-16T01:30:00,001][INFO ][o.e.x.m.MlDailyMaintenanceService] [li393-89.members.linode.com] triggering scheduled [ML] maintenance tasks
[2020-04-16T01:30:00,011][INFO ][o.e.x.m.a.TransportDeleteExpiredDataAction] [li393-89.members.linode.com] Deleting expired data
[2020-04-16T01:30:00,013][INFO ][o.e.x.s.SnapshotRetentionTask] [li393-89.members.linode.com] starting SLM retention snapshot cleanup task
[2020-04-16T01:30:00,019][INFO ][o.e.x.m.a.TransportDeleteExpiredDataAction] [li393-89.members.linode.com] Completed deletion of expired ML data
[2020-04-16T01:30:00,020][INFO ][o.e.x.m.MlDailyMaintenanceService] [li393-89.members.linode.com] Successfully completed [ML] maintenance tasks
[2020-04-16T01:56:11,290][INFO ][o.e.c.m.MetaDataCreateIndexService] [li393-89.members.linode.com] [index_45.79.189.33] creating index, cause [api], templates [], shards [1]/[1], mappings [_doc]
[2020-04-16T01:56:11,454][INFO ][o.e.c.m.MetaDataCreateIndexService] [li393-89.members.linode.com] [index_45.79.189.33_folder] creating index, cause [api], templates [], shards [1]/[1], mappings [_doc]
[2020-04-16T03:01:34,106][INFO ][o.e.c.m.MetaDataDeleteIndexService] [li393-89.members.linode.com] [index_45.79.189.33/4dnt8QBwRZW0-Varnd_vlA] deleting index
[2020-04-16T03:01:53,659][INFO ][o.e.c.m.MetaDataCreateIndexService] [li393-89.members.linode.com] [index_45.79.189.33] creating index, cause [api], templates [], shards [1]/[1], mappings [_doc]
[2020-04-16T03:28:25,262][INFO ][o.e.c.m.MetaDataMappingService] [li393-89.members.linode.com] [index_45.79.189.33/vQqoPbJ0QPW75O9zYiy2PA] update_mapping [_doc]
[2020-04-16T03:41:36,694][INFO ][o.e.c.m.MetaDataDeleteIndexService] [li393-89.members.linode.com] [index_45.79.189.33/vQqoPbJ0QPW75O9zYiy2PA] deleting index
[2020-04-16T03:42:03,165][INFO ][o.e.c.m.MetaDataCreateIndexService] [li393-89.members.linode.com] [index_45.79.189.33] creating index, cause [api], templates [], shards [1]/[1], mappings [_doc]

dadoonet · April 16, 2020, 7:05am

It's not the same error. Now FSCrawler can not start at all as it can't connect to the elasticsearch server. Is elasticsearch still running?

Lisahtwy · April 17, 2020, 12:12pm

Hi David,

Checked the elastic search service, its running!
The server where I am running fscrawler is connected through VPN to the sharepoint VM. Some times when i ping my elasticsearch server, it was not reachable due to VPN i think. So I just stopped VPN while indexing, that worked!!

Thanks,
Lisa

system · May 15, 2020, 12:12pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Fscrawler does not index to ES with https Elasticsearch	4	1033	October 27, 2020
Error while indexing documents into ES using Fscrawler Elasticsearch	6	2587	December 9, 2018
FScrawler issue while crawling through a remote host Elasticsearch	3	487	June 24, 2020
File owner not determined warning are coming for network drive - fsCrawler indexing sharepoint files Elasticsearch	3	493	May 8, 2020
Fscrawler for ES clustering Elasticsearch	41	2088	March 18, 2020

Errors while indexing mounted drive using fscrawler

Related topics