Pointing FSCrawler to a separate server for documents

bilpor · October 26, 2017, 11:02am

HI All,

I have FSCrawler working on a DEV box where the documents are located on the same server as FSCrawler and ElasticSearch. in the _settings.json file I just set the url to be my document location in the form "D:\MyDocs"

Now I'm moving elasticsearch and fscrawler onto a new server and placing the documents onto a seperate server. How should I format the value for url in my _settings.json file now?

I'm using FSCrawler 2.4

dadoonet · October 26, 2017, 11:14am

How do you normally access the documents from the server where FSCrawler is runing?

Is it a mounted point like Z:?

bilpor · October 26, 2017, 11:32am

It's a new dedicated server. we have never accessed the documents from here. I don't really want to map a network drive from the server where FSCrawler will be installed to the server where the docs are stored. Our set up will be a three server solution Server 1 is our webserver. server 2 our document server and server 3 our search server.

dadoonet · October 26, 2017, 12:20pm

So you will run FSCrawler "locally" from server 2, right ?

I think you need to reindex everything sadly as I believe I'm using the full path to generate a unique id of folders and files:

doc.getPath().setRoot(SignTool.sign(dirname));

This id is used IIRC to check if folders have been removed.

bilpor · October 26, 2017, 12:34pm

Yes, I'm setting up from scratch so It's what I put in the url to gain access to the drive on the other server. I'm installing FSCrawler on the Elasticsearch Server. So it's going on server 3

All servers are windows server 2012 R2

bilpor · October 27, 2017, 9:36am

HI dadoonet.

I've been given the green light such that I can map the drive as a network drive. I've therefore done that (mapped as as E drive) so in my settings file I have set the URL to "E:\\" . HOwever now when I try and run fscrawler, I recieve a fatal error. Using --debug on the command it says : failed to create elasticsearch client. Elasticsearch is up and running and I can Kibana is reaching it fine.

dadoonet · October 27, 2017, 9:54am

May be run it with --debug option and paste the logs here? (formatted please)

bilpor · October 27, 2017, 10:04am

Here's the debug output:

11:02:06,863 e[36mDEBUGe[m [f.p.e.c.f.u.FsCrawlerUtil] Mapping [2/_settings.json] already exists
11:02:06,863 e[36mDEBUGe[m [f.p.e.c.f.u.FsCrawlerUtil] Mapping [2/_settings_folder.json] already exists
11:02:06,863 e[36mDEBUGe[m [f.p.e.c.f.u.FsCrawlerUtil] Mapping [5/_settings.json] already exists
11:02:06,863 e[36mDEBUGe[m [f.p.e.c.f.u.FsCrawlerUtil] Mapping [5/_settings_folder.json] already exists
11:02:06,863 e[36mDEBUGe[m [f.p.e.c.f.u.FsCrawlerUtil] Mapping [6/_settings.json] already exists
11:02:06,863 e[36mDEBUGe[m [f.p.e.c.f.u.FsCrawlerUtil] Mapping [6/_settings_folder.json] already exists
11:02:06,879 e[36mDEBUGe[m [f.p.e.c.f.FsCrawler] Starting job [bamdocs]...
11:02:09,511 e[33mWARN e[m [f.p.e.c.f.c.ElasticsearchClientManager] failed to create elasticsearch client, disabling crawler...
11:02:09,511 e[1;31mFATALe[m [f.p.e.c.f.FsCrawler] Fatal error received while running the crawler: [null]
11:02:09,511 e[36mDEBUGe[m [f.p.e.c.f.FsCrawler] error caught
java.net.ConnectException: null
	at org.elasticsearch.client.http.nio.pool.RouteSpecificPool.timeout(RouteSpecificPool.java:168) ~[elasticsearch-rest-client-6.0.0-beta1.jar:6.0.0-beta1]
	at org.elasticsearch.client.http.nio.pool.AbstractNIOConnPool.requestTimeout(AbstractNIOConnPool.java:561) ~[elasticsearch-rest-client-6.0.0-beta1.jar:6.0.0-beta1]
	at org.elasticsearch.client.http.nio.pool.AbstractNIOConnPool$InternalSessionRequestCallback.timeout(AbstractNIOConnPool.java:822) ~[elasticsearch-rest-client-6.0.0-beta1.jar:6.0.0-beta1]
	at org.elasticsearch.client.http.impl.nio.reactor.SessionRequestImpl.timeout(SessionRequestImpl.java:183) ~[elasticsearch-rest-client-6.0.0-beta1.jar:6.0.0-beta1]
	at org.elasticsearch.client.http.impl.nio.reactor.DefaultConnectingIOReactor.processTimeouts(DefaultConnectingIOReactor.java:210) ~[elasticsearch-rest-client-6.0.0-beta1.jar:6.0.0-beta1]
	at org.elasticsearch.client.http.impl.nio.reactor.DefaultConnectingIOReactor.processEvents(DefaultConnectingIOReactor.java:155) ~[elasticsearch-rest-client-6.0.0-beta1.jar:6.0.0-beta1]
	at org.elasticsearch.client.http.impl.nio.reactor.AbstractMultiworkerIOReactor.execute(AbstractMultiworkerIOReactor.java:348) ~[elasticsearch-rest-client-6.0.0-beta1.jar:6.0.0-beta1]
	at org.elasticsearch.client.http.impl.nio.conn.PoolingNHttpClientConnectionManager.execute(PoolingNHttpClientConnectionManager.java:192) ~[elasticsearch-rest-client-6.0.0-beta1.jar:6.0.0-beta1]
	at org.elasticsearch.client.http.impl.nio.client.CloseableHttpAsyncClientBase$1.run(CloseableHttpAsyncClientBase.java:64) ~[elasticsearch-rest-client-6.0.0-beta1.jar:6.0.0-beta1]
	at java.lang.Thread.run(Unknown Source) ~[?:1.8.0_152]
11:02:09,526 e[36mDEBUGe[m [f.p.e.c.f.FsCrawlerImpl] Closing FS crawler [bamdocs]
11:02:09,526 e[36mDEBUGe[m [f.p.e.c.f.FsCrawlerImpl] FS crawler Rest service stopped
11:02:09,526 e[36mDEBUGe[m [f.p.e.c.f.c.ElasticsearchClientManager] Closing Elasticsearch client manager
11:02:09,526 e[36mDEBUGe[m [f.p.e.c.f.c.ElasticsearchClient] Closing REST client
11:02:09,526 e[36mDEBUGe[m [f.p.e.c.f.c.ElasticsearchClient] REST client closed
11:02:09,526 e[36mDEBUGe[m [f.p.e.c.f.FsCrawlerImpl] ES Client Manager stopped
11:02:09,526 e[32mINFO e[m [f.p.e.c.f.FsCrawlerImpl] FS crawler [bamdocs] stopped

bilpor · October 27, 2017, 10:39am

Also here is my _settings file

{
  "name" : "bamdocs",
  "fs" : {
    "url" : "E:\\",
    "update_rate" : "15m",
    "excludes" : [ "~*" ],
    "json_support" : false,
    "filename_as_id" : false,
    "add_filesize" : true,
    "remove_deleted" : true,
    "add_as_inner_object" : false,
    "store_source" : false,
    "index_content" : true,
    "attributes_support" : false,
    "raw_metadata" : true,
    "xml_support" : false,
    "index_folders" : true,
    "lang_detect" : false,
    "continue_on_error" : false,
    "pdf_ocr" : true,
    "ocr" : {
      "language" : "eng"
    }
  },
  "elasticsearch" : {
    "nodes" : [ {
      "host" : "10.128.128.16",
      "port" : 9200,
      "scheme" : "HTTP"
    } ],
    "index" : "bamindex",
    "bulk_size" : 100,
    "flush_interval" : "5s",
    "username" : "elastic",
    "password" : "elastic123"
  },
  "rest" : {
    "scheme" : "HTTP",
    "host" : "10.128.128.16",
    "port" : 8080,
    "endpoint" : "fscrawler"
  }
}

bilpor · October 27, 2017, 11:00am

Hi dadoonet,

I realised in the settings file I had the rest services ip address set incorrectly. (typo). After correcting this it gets further but now says the E:\ doesn't exist

Here's the log:

11:58:05,843 e[36mDEBUGe[m [f.p.e.c.f.u.FsCrawlerUtil] Mapping [2/_settings.json] already exists
11:58:05,858 e[36mDEBUGe[m [f.p.e.c.f.u.FsCrawlerUtil] Mapping [2/_settings_folder.json] already exists
11:58:05,858 e[36mDEBUGe[m [f.p.e.c.f.u.FsCrawlerUtil] Mapping [5/_settings.json] already exists
11:58:05,858 e[36mDEBUGe[m [f.p.e.c.f.u.FsCrawlerUtil] Mapping [5/_settings_folder.json] already exists
11:58:05,858 e[36mDEBUGe[m [f.p.e.c.f.u.FsCrawlerUtil] Mapping [6/_settings.json] already exists
11:58:05,858 e[36mDEBUGe[m [f.p.e.c.f.u.FsCrawlerUtil] Mapping [6/_settings_folder.json] already exists
11:58:05,858 e[36mDEBUGe[m [f.p.e.c.f.FsCrawler] Starting job [bamdocs]...
11:58:08,166 e[36mDEBUGe[m [f.p.e.c.f.c.ElasticsearchClient] Using elasticsearch >= 5, so we can use ingest node feature
11:58:08,651 e[33mWARN e[m [f.p.e.c.f.FsCrawler] We found old configuration index settings in [C:\Program Files\Elastic\FsCrawler\Jobs]┬áor [C:\Program Files\Elastic\FsCrawler\Jobs\bamdocs\_mappings]. You should look at the documentation about upgrades: https://github.com/dadoonet/fscrawler#upgrade-to-23
11:58:08,651 e[32mINFO e[m [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
11:58:08,651 e[36mDEBUGe[m [f.p.e.c.f.c.ElasticsearchClientManager] FS crawler connected to an elasticsearch [5.6.3] node.
11:58:08,651 e[36mDEBUGe[m [f.p.e.c.f.c.ElasticsearchClient] create index [bamindex]
11:58:08,682 e[36mDEBUGe[m [f.p.e.c.f.c.ElasticsearchClient] create index [bamdocs_folder]
11:58:08,697 e[36mDEBUGe[m [f.p.e.c.f.FsCrawlerImpl] creating fs crawler thread [bamdocs] for [E:\] every [15m]
11:58:08,697 e[32mINFO e[m [f.p.e.c.f.FsCrawlerImpl] FS crawler started for [bamdocs] for [E:\] every [15m]
11:58:08,697 e[36mDEBUGe[m [f.p.e.c.f.FsCrawlerImpl] Fs crawler thread [bamdocs] is now running. Run #1...
11:58:08,713 e[33mWARN e[m [f.p.e.c.f.FsCrawlerImpl] Error while crawling E:\: E:\ doesn't exists.
11:58:08,713 e[33mWARN e[m [f.p.e.c.f.FsCrawlerImpl] Full stacktrace
java.lang.RuntimeException: E:\ doesn't exists.
	at fr.pilato.elasticsearch.crawler.fs.FsCrawlerImpl$FSParser.run(FsCrawlerImpl.java:325) [fscrawler-2.4.jar:?]
	at java.lang.Thread.run(Unknown Source) [?:1.8.0_152]
11:58:08,713 e[32mINFO e[m [f.p.e.c.f.FsCrawlerImpl] FS crawler is stopping after 1 run
11:58:09,492 e[36mDEBUGe[m [f.p.e.c.f.FsCrawlerImpl] Closing FS crawler [bamdocs]
11:58:09,492 e[36mDEBUGe[m [f.p.e.c.f.FsCrawlerImpl] FS crawler thread is now stopped
11:58:09,492 e[36mDEBUGe[m [f.p.e.c.f.FsCrawlerImpl] FS crawler Rest service stopped
11:58:09,492 e[36mDEBUGe[m [f.p.e.c.f.c.ElasticsearchClientManager] Closing Elasticsearch client manager
11:58:09,492 e[36mDEBUGe[m [f.p.e.c.f.c.ElasticsearchClient] Closing REST client
11:58:09,492 e[36mDEBUGe[m [f.p.e.c.f.c.ElasticsearchClient] REST client closed
11:58:09,492 e[36mDEBUGe[m [f.p.e.c.f.FsCrawlerImpl] ES Client Manager stopped
11:58:09,508 e[32mINFO e[m [f.p.e.c.f.FsCrawlerImpl] FS crawler [bamdocs] stopped

bilpor · October 27, 2017, 11:10am

Finally working. It didn't like the url to the mapped drive. Instead I had to set it directly against the server in the form:

\\servername\driveletter$\foldername

system · November 24, 2017, 11:11am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Can fscrawler index files from different servers? Elasticsearch	10	1524	February 12, 2020
FSCrawler path to remote dir Elasticsearch	32	1546	April 9, 2020
Fscrawler for ES clustering Elasticsearch	41	2088	March 18, 2020
Unable to send data using FSCrawler to ElasticSearch Elasticsearch	15	3099	August 6, 2018
Indexing using FsCrawler - remote windows server Elasticsearch	2	511	June 19, 2020

Pointing FSCrawler to a separate server for documents

Related topics