Pointing FSCrawler to a separate server for documents

HI All,

I have FSCrawler working on a DEV box where the documents are located on the same server as FSCrawler and ElasticSearch. in the _settings.json file I just set the url to be my document location in the form "D:\MyDocs"

Now I'm moving elasticsearch and fscrawler onto a new server and placing the documents onto a seperate server. How should I format the value for url in my _settings.json file now?

I'm using FSCrawler 2.4

How do you normally access the documents from the server where FSCrawler is runing?

Is it a mounted point like Z:?

It's a new dedicated server. we have never accessed the documents from here. I don't really want to map a network drive from the server where FSCrawler will be installed to the server where the docs are stored. Our set up will be a three server solution Server 1 is our webserver. server 2 our document server and server 3 our search server.

So you will run FSCrawler "locally" from server 2, right ?

I think you need to reindex everything sadly as I believe I'm using the full path to generate a unique id of folders and files:

doc.getPath().setRoot(SignTool.sign(dirname));

This id is used IIRC to check if folders have been removed.

Yes, I'm setting up from scratch so It's what I put in the url to gain access to the drive on the other server. I'm installing FSCrawler on the Elasticsearch Server. So it's going on server 3

All servers are windows server 2012 R2

HI dadoonet.

I've been given the green light such that I can map the drive as a network drive. I've therefore done that (mapped as as E drive) so in my settings file I have set the URL to "E:\\" . HOwever now when I try and run fscrawler, I recieve a fatal error. Using --debug on the command it says : failed to create elasticsearch client. Elasticsearch is up and running and I can Kibana is reaching it fine.

May be run it with --debug option and paste the logs here? (formatted please)

Here's the debug output:

11:02:06,863 e[36mDEBUGe[m [f.p.e.c.f.u.FsCrawlerUtil] Mapping [2/_settings.json] already exists
11:02:06,863 e[36mDEBUGe[m [f.p.e.c.f.u.FsCrawlerUtil] Mapping [2/_settings_folder.json] already exists
11:02:06,863 e[36mDEBUGe[m [f.p.e.c.f.u.FsCrawlerUtil] Mapping [5/_settings.json] already exists
11:02:06,863 e[36mDEBUGe[m [f.p.e.c.f.u.FsCrawlerUtil] Mapping [5/_settings_folder.json] already exists
11:02:06,863 e[36mDEBUGe[m [f.p.e.c.f.u.FsCrawlerUtil] Mapping [6/_settings.json] already exists
11:02:06,863 e[36mDEBUGe[m [f.p.e.c.f.u.FsCrawlerUtil] Mapping [6/_settings_folder.json] already exists
11:02:06,879 e[36mDEBUGe[m [f.p.e.c.f.FsCrawler] Starting job [bamdocs]...
11:02:09,511 e[33mWARN e[m [f.p.e.c.f.c.ElasticsearchClientManager] failed to create elasticsearch client, disabling crawler...
11:02:09,511 e[1;31mFATALe[m [f.p.e.c.f.FsCrawler] Fatal error received while running the crawler: [null]
11:02:09,511 e[36mDEBUGe[m [f.p.e.c.f.FsCrawler] error caught
java.net.ConnectException: null
	at org.elasticsearch.client.http.nio.pool.RouteSpecificPool.timeout(RouteSpecificPool.java:168) ~[elasticsearch-rest-client-6.0.0-beta1.jar:6.0.0-beta1]
	at org.elasticsearch.client.http.nio.pool.AbstractNIOConnPool.requestTimeout(AbstractNIOConnPool.java:561) ~[elasticsearch-rest-client-6.0.0-beta1.jar:6.0.0-beta1]
	at org.elasticsearch.client.http.nio.pool.AbstractNIOConnPool$InternalSessionRequestCallback.timeout(AbstractNIOConnPool.java:822) ~[elasticsearch-rest-client-6.0.0-beta1.jar:6.0.0-beta1]
	at org.elasticsearch.client.http.impl.nio.reactor.SessionRequestImpl.timeout(SessionRequestImpl.java:183) ~[elasticsearch-rest-client-6.0.0-beta1.jar:6.0.0-beta1]
	at org.elasticsearch.client.http.impl.nio.reactor.DefaultConnectingIOReactor.processTimeouts(DefaultConnectingIOReactor.java:210) ~[elasticsearch-rest-client-6.0.0-beta1.jar:6.0.0-beta1]
	at org.elasticsearch.client.http.impl.nio.reactor.DefaultConnectingIOReactor.processEvents(DefaultConnectingIOReactor.java:155) ~[elasticsearch-rest-client-6.0.0-beta1.jar:6.0.0-beta1]
	at org.elasticsearch.client.http.impl.nio.reactor.AbstractMultiworkerIOReactor.execute(AbstractMultiworkerIOReactor.java:348) ~[elasticsearch-rest-client-6.0.0-beta1.jar:6.0.0-beta1]
	at org.elasticsearch.client.http.impl.nio.conn.PoolingNHttpClientConnectionManager.execute(PoolingNHttpClientConnectionManager.java:192) ~[elasticsearch-rest-client-6.0.0-beta1.jar:6.0.0-beta1]
	at org.elasticsearch.client.http.impl.nio.client.CloseableHttpAsyncClientBase$1.run(CloseableHttpAsyncClientBase.java:64) ~[elasticsearch-rest-client-6.0.0-beta1.jar:6.0.0-beta1]
	at java.lang.Thread.run(Unknown Source) ~[?:1.8.0_152]
11:02:09,526 e[36mDEBUGe[m [f.p.e.c.f.FsCrawlerImpl] Closing FS crawler [bamdocs]
11:02:09,526 e[36mDEBUGe[m [f.p.e.c.f.FsCrawlerImpl] FS crawler Rest service stopped
11:02:09,526 e[36mDEBUGe[m [f.p.e.c.f.c.ElasticsearchClientManager] Closing Elasticsearch client manager
11:02:09,526 e[36mDEBUGe[m [f.p.e.c.f.c.ElasticsearchClient] Closing REST client
11:02:09,526 e[36mDEBUGe[m [f.p.e.c.f.c.ElasticsearchClient] REST client closed
11:02:09,526 e[36mDEBUGe[m [f.p.e.c.f.FsCrawlerImpl] ES Client Manager stopped
11:02:09,526 e[32mINFO e[m [f.p.e.c.f.FsCrawlerImpl] FS crawler [bamdocs] stopped

Also here is my _settings file

{
  "name" : "bamdocs",
  "fs" : {
    "url" : "E:\\",
    "update_rate" : "15m",
    "excludes" : [ "~*" ],
    "json_support" : false,
    "filename_as_id" : false,
    "add_filesize" : true,
    "remove_deleted" : true,
    "add_as_inner_object" : false,
    "store_source" : false,
    "index_content" : true,
    "attributes_support" : false,
    "raw_metadata" : true,
    "xml_support" : false,
    "index_folders" : true,
    "lang_detect" : false,
    "continue_on_error" : false,
    "pdf_ocr" : true,
    "ocr" : {
      "language" : "eng"
    }
  },
  "elasticsearch" : {
    "nodes" : [ {
      "host" : "10.128.128.16",
      "port" : 9200,
      "scheme" : "HTTP"
    } ],
    "index" : "bamindex",
    "bulk_size" : 100,
    "flush_interval" : "5s",
    "username" : "elastic",
    "password" : "elastic123"
  },
  "rest" : {
    "scheme" : "HTTP",
    "host" : "10.128.128.16",
    "port" : 8080,
    "endpoint" : "fscrawler"
  }
}

Hi dadoonet,

I realised in the settings file I had the rest services ip address set incorrectly. (typo). After correcting this it gets further but now says the E:\ doesn't exist

Here's the log:

11:58:05,843 e[36mDEBUGe[m [f.p.e.c.f.u.FsCrawlerUtil] Mapping [2/_settings.json] already exists
11:58:05,858 e[36mDEBUGe[m [f.p.e.c.f.u.FsCrawlerUtil] Mapping [2/_settings_folder.json] already exists
11:58:05,858 e[36mDEBUGe[m [f.p.e.c.f.u.FsCrawlerUtil] Mapping [5/_settings.json] already exists
11:58:05,858 e[36mDEBUGe[m [f.p.e.c.f.u.FsCrawlerUtil] Mapping [5/_settings_folder.json] already exists
11:58:05,858 e[36mDEBUGe[m [f.p.e.c.f.u.FsCrawlerUtil] Mapping [6/_settings.json] already exists
11:58:05,858 e[36mDEBUGe[m [f.p.e.c.f.u.FsCrawlerUtil] Mapping [6/_settings_folder.json] already exists
11:58:05,858 e[36mDEBUGe[m [f.p.e.c.f.FsCrawler] Starting job [bamdocs]...
11:58:08,166 e[36mDEBUGe[m [f.p.e.c.f.c.ElasticsearchClient] Using elasticsearch >= 5, so we can use ingest node feature
11:58:08,651 e[33mWARN e[m [f.p.e.c.f.FsCrawler] We found old configuration index settings in [C:\Program Files\Elastic\FsCrawler\Jobs] or [C:\Program Files\Elastic\FsCrawler\Jobs\bamdocs\_mappings]. You should look at the documentation about upgrades: https://github.com/dadoonet/fscrawler#upgrade-to-23
11:58:08,651 e[32mINFO e[m [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
11:58:08,651 e[36mDEBUGe[m [f.p.e.c.f.c.ElasticsearchClientManager] FS crawler connected to an elasticsearch [5.6.3] node.
11:58:08,651 e[36mDEBUGe[m [f.p.e.c.f.c.ElasticsearchClient] create index [bamindex]
11:58:08,682 e[36mDEBUGe[m [f.p.e.c.f.c.ElasticsearchClient] create index [bamdocs_folder]
11:58:08,697 e[36mDEBUGe[m [f.p.e.c.f.FsCrawlerImpl] creating fs crawler thread [bamdocs] for [E:\] every [15m]
11:58:08,697 e[32mINFO e[m [f.p.e.c.f.FsCrawlerImpl] FS crawler started for [bamdocs] for [E:\] every [15m]
11:58:08,697 e[36mDEBUGe[m [f.p.e.c.f.FsCrawlerImpl] Fs crawler thread [bamdocs] is now running. Run #1...
11:58:08,713 e[33mWARN e[m [f.p.e.c.f.FsCrawlerImpl] Error while crawling E:\: E:\ doesn't exists.
11:58:08,713 e[33mWARN e[m [f.p.e.c.f.FsCrawlerImpl] Full stacktrace
java.lang.RuntimeException: E:\ doesn't exists.
	at fr.pilato.elasticsearch.crawler.fs.FsCrawlerImpl$FSParser.run(FsCrawlerImpl.java:325) [fscrawler-2.4.jar:?]
	at java.lang.Thread.run(Unknown Source) [?:1.8.0_152]
11:58:08,713 e[32mINFO e[m [f.p.e.c.f.FsCrawlerImpl] FS crawler is stopping after 1 run
11:58:09,492 e[36mDEBUGe[m [f.p.e.c.f.FsCrawlerImpl] Closing FS crawler [bamdocs]
11:58:09,492 e[36mDEBUGe[m [f.p.e.c.f.FsCrawlerImpl] FS crawler thread is now stopped
11:58:09,492 e[36mDEBUGe[m [f.p.e.c.f.FsCrawlerImpl] FS crawler Rest service stopped
11:58:09,492 e[36mDEBUGe[m [f.p.e.c.f.c.ElasticsearchClientManager] Closing Elasticsearch client manager
11:58:09,492 e[36mDEBUGe[m [f.p.e.c.f.c.ElasticsearchClient] Closing REST client
11:58:09,492 e[36mDEBUGe[m [f.p.e.c.f.c.ElasticsearchClient] REST client closed
11:58:09,492 e[36mDEBUGe[m [f.p.e.c.f.FsCrawlerImpl] ES Client Manager stopped
11:58:09,508 e[32mINFO e[m [f.p.e.c.f.FsCrawlerImpl] FS crawler [bamdocs] stopped

Finally working. It didn't like the url to the mapped drive. Instead I had to set it directly against the server in the form:

\\servername\driveletter$\foldername

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.