Fscrawler for ES clustering

Hello @dadoonet,

I have opened port 22 from source es servers with destination remote server.
SSH client is installed on remote server.

But still i am getting same path does not exists error when trying to run fscrawler job.
Can we have to install SSH client on ES servers also?

Regards,
Priyanka

Do you have a SSH server running on the machine which has the files you want to index?

Hello @dadoonet,

Yes i have SSH server running on the machine which has the files I want to index that means on remote server.

But NOT on ES server, where i am running fscrawler job.

Regards,
Priyanka

From the machine where FSCrawler is running, can you run:

ssh mynode.mydomain.com
cd /path/to/data/dir/on/server
ls

Hello @dadoonet,

I am getting below error:

'ssh' is not recognized as an internal or external command,
operable program or batch file.

Regards,
Priyanka

But you told me that:

I can't help if you can't test the connection.

Hello @dadoonet,

I was trying this on es server.
mynode.mydomain.com this should be ES server or remote server? I am confused.

Kindly help!!!

Regards,
Priyanka

FSCrawler runs on machineA. In FSCrawler settings, you are telling that you want to use SSH to connect to machineB and crawl files under /path/to/files.

So, you need to test from machineA:

ssh machineB
cd /path/to/files
ls

Hello @dadoonet,

SSH is working on machineA. I am able to navigate to "/path/to/files". but ls is not working after cd.

Regards,
Priyanka

Could you share the full details? All commands you sent (but passwords) and all the output.

MachineA is where my FSCrawler runs. And machineB is from I want to use SSH to connect and crawls files under /path/to/files .

From machineA I tried to ssh machineB as suggested.

First command:
SSH machineB.com

They I try to navigate to mentioned folder:
2nd command:
cd \\path\\to\\files

I am able to navigate till mentioned path folder.

They I tried giving command
3rd command:
ls

but it is giving me message that, 'ls' is not recognized as an internal or external command, operable program or batch file.

and my config file look like:

name: "remote"
fs:
  url: "path\\to\\files"
  server:
  hostname: "MachineB.com"
  port: 22
  username: "usrename"
  password: "pwd"
  protocol: "ssh"
  update_rate: "15m"
  excludes:
  - "*/~*"
  json_support: false
  filename_as_id: false
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: false
  store_source: false
  index_content: true
  attributes_support: false
  raw_metadata: false
  xml_support: false
  index_folders: true
  lang_detect: false
  continue_on_error: false
  ocr:
    language: "eng"
    enabled: true
    pdf_strategy: "ocr_and_text"
  follow_symlinks: false
elasticsearch:
  nodes:
  - url: "masternode.com"
  - url: "datanode1.com"
  - url: "datanode2.com"
  bulk_size: 100
  flush_interval: "5s"
  byte_size: "10mb" 

Regards,
Priyanka

Is that a windows server? Could you run dir instead?

Could you try also one of those:

url: "\\path\\to\\files"
url: "/path/to/files"

Hello @dadoonet,

Dir worked fine for me. It is showing me list of all the files present in that folder.

And also I have tried replacing url: "\path\to\files" to url: "/path/to/files" but I am getting same error:

04:23:51,178 WARN [f.p.e.c.f.FsParserAbstract] Error while crawling /path/to/files: /path/to/files doesn't exists.

Regards,
Priyanka

Can you run with --trace option and share the full logs?

Hello @dadoonet,

Please find below --trace option logs:

05:02:34,382 DEBUG [f.p.e.c.f.c.ElasticsearchClientUtil] Trying to find a client version 7
05:02:34,382 TRACE [f.p.e.c.f.c.ElasticsearchClientUtil] Trying to find a class named [fr.pilato.elasticsearch.crawler.fs.client.v7.ElasticsearchClientV7]
05:02:34,382 TRACE [f.p.e.c.f.c.ElasticsearchClientUtil] Found [fr.pilato.elasticsearch.crawler.fs.client.v7.ElasticsearchClientV7] class as the elasticsearch client implementation.
05:02:35,836 INFO  [f.p.e.c.f.c.v.ElasticsearchClientV7] Elasticsearch Client for version 7.x connected to a node running version 7.4.2
05:02:35,898 INFO  [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
05:02:35,898 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler started in watch mode. It will run unless you stop it with CTRL+C.
05:02:35,914 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV7] FS crawler connected to an elasticsearch [7.4.2] node.
05:02:35,914 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV7] create index [remote]
05:02:35,914 TRACE [f.p.e.c.f.c.v.ElasticsearchClientV7] index settings: [{
  "settings": {
    "number_of_shards": 1,
    "index.mapping.total_fields.limit": 2000,
    "analysis": {
      "analyzer": {
        "fscrawler_path": {
          "tokenizer": "fscrawler_path"
        }
      },
      "tokenizer": {
        "fscrawler_path": {
          "type": "path_hierarchy"
        }
      }
    }
  },
  "mappings": {
    "dynamic_templates": [
      {
        "raw_as_text": {
          "path_match": "meta.raw.*",
          "mapping": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          }
        }
      }
    ],
    "properties": {
      "attachment": {
        "type": "binary",
        "doc_values": false
      },
      "attributes": {
        "properties": {
          "group": {
            "type": "keyword"
          },
          "owner": {
            "type": "keyword"
          }
        }
      },
      "content": {
        "type": "text"
      },
      "file": {
        "properties": {
          "content_type": {
            "type": "keyword"
          },
          "filename": {
            "type": "keyword",
            "store": true
          },
          "extension": {
            "type": "keyword"
          },
          "filesize": {
            "type": "long"
          },
          "indexed_chars": {
            "type": "long"
          },
          "indexing_date": {
            "type": "date",
            "format": "dateOptionalTime"
          },
          "created": {
            "type": "date",
            "format": "dateOptionalTime"
          },
          "last_modified": {
            "type": "date",
            "format": "dateOptionalTime"
          },
          "last_accessed": {
            "type": "date",
            "format": "dateOptionalTime"
          },
          "checksum": {
            "type": "keyword"
          },
          "url": {
            "type": "keyword",
            "index": false
          }
        }
      },
      "meta": {
        "properties": {
          "author": {
            "type": "text"
          },
          "date": {
            "type": "date",
            "format": "dateOptionalTime"
          },
          "keywords": {
            "type": "text"
          },
          "title": {
            "type": "text"
          },
          "language": {
            "type": "keyword"
          },
          "format": {
            "type": "text"
          },
          "identifier": {
            "type": "text"
          },
          "contributor": {
            "type": "text"
          },
          "coverage": {
            "type": "text"
          },
          "modifier": {
            "type": "text"
          },
          "creator_tool": {
            "type": "keyword"
          },
          "publisher": {
            "type": "text"
          },
          "relation": {
            "type": "text"
          },
          "rights": {
            "type": "text"
          },
          "source": {
            "type": "text"
          },
          "type": {
            "type": "text"
          },
          "description": {
            "type": "text"
          },
          "created": {
            "type": "date",
            "format": "dateOptionalTime"
          },
          "print_date": {
            "type": "date",
            "format": "dateOptionalTime"
          },
          "metadata_date": {
            "type": "date",
            "format": "dateOptionalTime"
          },
          "latitude": {
            "type": "text"
          },
          "longitude": {
            "type": "text"
          },
          "altitude": {
            "type": "text"
          },
          "rating": {
            "type": "byte"
          },
          "comments": {
            "type": "text"
          }
        }
      },
      "path": {
        "properties": {
          "real": {
            "type": "keyword",
            "fields": {
              "tree": {
                "type": "text",
                "analyzer": "fscrawler_path",
                "fielddata": true
              },
              "fulltext": {
                "type": "text"
              }
            }
          },
          "root": {
            "type": "keyword"
          },
          "virtual": {
            "type": "keyword",
            "fields": {
              "tree": {
                "type": "text",
                "analyzer": "fscrawler_path",
                "fielddata": true
              },
              "fulltext": {
                "type": "text"
              }
            }
          }
        }
      }
    }
  }
}
]
05:02:36,133 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV7] wait for yellow health on index [remote]
05:02:36,164 TRACE [f.p.e.c.f.c.v.ElasticsearchClientV7] health response: {"cluster_name":"elasticsearch","status":"green","timed_out":false,"number_of_nodes":2
,"number_of_data_nodes":2,"active_primary_shards":1,"active_shards":2,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":0,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":100.0}
05:02:36,164 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV7] create index [remote_folder]
05:02:36,164 TRACE [f.p.e.c.f.c.v.ElasticsearchClientV7] index settings: [{
  "settings": {
    "analysis": {
      "analyzer": {
        "fscrawler_path": {
          "tokenizer": "fscrawler_path"
        }
      },
      "tokenizer": {
        "fscrawler_path": {
          "type": "path_hierarchy"
        }
      }
    }
  },
  "mappings": {
    "properties" : {
      "real" : {
        "type" : "keyword",
        "store" : true
      },
      "root" : {
        "type" : "keyword",
        "store" : true
      },
      "virtual" : {
        "type" : "keyword",
        "store" : true
      }
    }
  }
}
]
05:02:36,289 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV7] wait for yellow health on index [remote_folder]
05:02:36,305 TRACE [f.p.e.c.f.c.v.ElasticsearchClientV7] health response: {"cluster_name":"elasticsearch","status":"green","timed_out":false,"number_of_nodes":2
,"number_of_data_nodes":2,"active_primary_shards":1,"active_shards":2,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":0,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":100.0}
05:02:36,305 DEBUG [f.p.e.c.f.FsParserAbstract] creating fs crawler thread [remote] for [\\path\\to\\files] every [15m]
05:02:36,305 INFO  [f.p.e.c.f.FsParserAbstract] FS crawler started for [remote] for [\\path\\to\\files] every [15m]
05:02:36,305 DEBUG [f.p.e.c.f.FsParserAbstract] Fs crawler thread [remote] is now running. Run #1...
05:02:36,305 WARN  [f.p.e.c.f.FsParserAbstract] Error while crawling \\path\\to\\files: \\path\\to\\files doesn't exists.
05:02:36,305 WARN  [f.p.e.c.f.FsParserAbstract] Full stacktrace java.lang.RuntimeException: \\path\\to\\files doesn't exists.at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.run(FsParserAbstract.java:130) [fscrawler-core-2.7-SNAPSHOT.jar:?]at java.lang.Thread.run(Thread.java:834) [?:?]
05:02:36,305 DEBUG [f.p.e.c.f.FsParserAbstract] Fs crawler is going to sleep for 15m 

Regards,
Priyanka

Your FSCrawler config is incorrect. It should be something like:

name: "remote"
fs:
  url: "path\\to\\files"
  update_rate: "15m"
  excludes:
  - "*/~*"
  json_support: false
  filename_as_id: false
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: false
  store_source: false
  index_content: true
  attributes_support: false
  raw_metadata: false
  xml_support: false
  index_folders: true
  lang_detect: false
  continue_on_error: false
  ocr:
    language: "eng"
    enabled: true
    pdf_strategy: "ocr_and_text"
  follow_symlinks: false
server:
  hostname: "MachineB.com"
  port: 22
  username: "usrename"
  password: "pwd"
  protocol: "ssh"
elasticsearch:
  nodes:
  - url: "masternode.com"
  - url: "datanode1.com"
  - url: "datanode2.com"
  bulk_size: 100
  flush_interval: "5s"
  byte_size: "10mb" 

Hello @dadoonet,

I have corrected config file as you have mentioned in above reply.
But still it is not working for me. It is asking me to create job again:

E:\new\fscrawler-es7-2.7-20200122.065827-76\fscrawler-es7-2.7-SNAPSHOT\bin>fscrawler remote --loop 1
04:28:21,192 INFO [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [752.4mb/12gb=6.12%], RAM [30.4gb/47.9gb=63.44%], Swap [38.7gb/54.9gb=70.44%].
04:28:21,474 WARN [f.p.e.c.f.c.FsCrawlerCli] job [remote] does not exist
04:28:21,474 INFO [f.p.e.c.f.c.FsCrawlerCli] Do you want to create it (Y/N)?

Regards,
Priyanka

Could you create it again?

Hello @dadoonet,

It is giving me another error now:

09:09:54,816 INFO  [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [752.5mb/12gb=6.12%], RAM [30.4gb/47.9gb=63.44%], Swap [38.7gb/54.9gb=70.42%].
09:09:56,598 WARN  [f.p.e.c.f.c.v.ElasticsearchClientV7] failed to create elasticsearch client, disabling crawler...
09:09:56,598 FATAL [f.p.e.c.f.c.FsCrawlerCli] We can not start Elasticsearch Client. Exiting.
java.net.ConnectException: Connection refused: no further information at org.elasticsearch.client.RestClient.extractAndWrapCause(RestClient.java:823) ~[elasticsearch-rest-client-7.5.1.jar:7.5.1]at org.elasticsearch.client.RestClient.performRequest(RestClient.java:24
8) ~[elasticsearch-rest-client-7.5.1.jar:7.5.1]at org.elasticsearch.client.RestClient.performRequest(RestClient.java:23
5) ~[elasticsearch-rest-client-7.5.1.jar:7.5.1]at org.elasticsearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1514) ~[elasticsearch-rest-high-level-client-7.5.1.jar:7.5.1]at org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:1499) ~[elasticsearch-rest-high-level-client-7.5.1.jar:7.5.1]at org.elasticsearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:1466) ~[elasticsearch-rest-high-level-client-7.5.1.jar:7.5.1]at org.elasticsearch.client.RestHighLevelClient.info(RestHighLevelClient.java:730) ~[elasticsearch-rest-high-level-client-7.5.1.jar:7.5.1]at fr.pilato.elasticsearch.crawler.fs.client.v7.ElasticsearchClientV7.getVersion(ElasticsearchClientV7.java:169) ~[fscrawler-elasticsearch-client-v7-2.7-SNAPSHOT.jar:?]at fr.pilato.elasticsearch.crawler.fs.client.ElasticsearchClient.checkVersion(ElasticsearchClient.java:181) ~[fscrawler-elasticsearch-client-base-2.7-SNAPSHOT.jar:?]
at fr.pilato.elasticsearch.crawler.fs.client.v7.ElasticsearchClientV7.st
art(ElasticsearchClientV7.java:142) ~[fscrawler-elasticsearch-client-v7-2.7-SNAP
SHOT.jar:?]
at fr.pilato.elasticsearch.crawler.fs.cli.FsCrawlerCli.main(FsCrawlerCli
.java:257) [fscrawler-cli-2.7-SNAPSHOT.jar:?]
Caused by: java.net.ConnectException: Connection refused: no further information

at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:?]
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:779) ~[?:?]
at org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processEvent(DefaultConnectingIOReactor.java:174) ~[httpcore-nio-4.4.12.jar:4.4.12]
at org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processEv
ents(DefaultConnectingIOReactor.java:148) ~[httpcore-nio-4.4.12.jar:4.4.12]
at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor.execute(AbstractMultiworkerIOReactor.java:351) ~[httpcore-nio-4.4.12.jar:4.4.12]
at org.apache.http.impl.nio.conn.PoolingNHttpClientConnectionManager.execute(PoolingNHttpClientConnectionManager.java:221) ~[httpasyncclient-4.1.4.jar:4.1.4]
at org.apache.http.impl.nio.client.CloseableHttpAsyncClientBase$1.run(CloseableHttpAsyncClientBase.java:64) ~[httpasyncclient-4.1.4.jar:4.1.4]
at java.lang.Thread.run(Thread.java:834) ~[?:?]
09:09:56,598 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler [remote] stopped
09:09:56,613 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler [remote] stopped 

Regards,
Priyanka