Fscrawler for ES clustering

Hello,

I have configured ES cluster with 3 nodes.
I want to index files using fscrawler.
Is there any setting i need to do to mention all 3 nodes in fscrawler config file?

Thanks,
Priyanka

Hey

Have a look here:

https://fscrawler.readthedocs.io/en/latest/admin/fs/elasticsearch.html#node-settings

There's an example of this.

Hello @dadoonet,

Thanks for your reply!!!!

One more question, I want to index files that are located on another server.
That means my indexing job will run from one server but files should be indexed from another server.
So is there any options for such a condition or we can mentioned path directly in url ?

Regards,
Priyanka

To index remote files you can:

  • run FSCrawler on the remote machine (indexing local files)
  • mount the remote dir on the machine where FSCrawler is running
  • use ssh to index remotely

The first solution is the best IMO.

Hello @dadoonet,

I am trying to use SSH. But it is not working for me.

My config file:

---
name: "remote"
fs:
  url: "C:\\tmp\\priyanka"
server: 
    hostname: "Myremoteserver.com"
	username: "myusername"
	password: "Mypassword"
    port: 22
    protocol: "ssh"
  update_rate: "15m"
  excludes:
  - "*/~*"
  json_support: false
  filename_as_id: false
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: false
  store_source: false
  index_content: true
  attributes_support: false
  raw_metadata: false
  xml_support: false
  index_folders: true
  lang_detect: false
  continue_on_error: false
  ocr:
    language: "eng"
    enabled: true
    pdf_strategy: "ocr_and_text"
  follow_symlinks: false
elasticsearch:
  nodes:
  - url: "http://127.0.0.1:9200"
  bulk_size: 100
  flush_interval: "5s"
  byte_size: "10mb"

Regards,
Priyanka

Could you share FSCrawler logs?
Start with the --debug option.

Hello @dadoonet,

I have already created fscrawler job. And updated settings.yaml file.
When am trying to run that same job, it is showing that , this job is not exists. Do you want to create it again.

fscrawler remote --debug
09:17:35,051 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [2/_settings.json] already exists
09:17:35,055 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [2/_settings_folder.json]already exists
09:17:35,055 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [5/_settings.json] already exists
09:17:35,055 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [5/_settings_folder.json]already exists
09:17:35,055 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings.json] already exists
09:17:35,055 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings_folder.json]already exists
09:17:35,055 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [7/_settings.json] already exists
09:17:35,055 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [7/_settings_folder.json]already exists
09:17:35,055 DEBUG [f.p.e.c.f.c.FsCrawlerCli] Starting job [remote]...
09:17:35,347 WARN  [f.p.e.c.f.c.FsCrawlerCli] job [remote] does not exist
09:17:35,347 INFO  [f.p.e.c.f.c.FsCrawlerCli] Do you want to create it (Y/N)?

Regards,
Priyanka

Which FSCrawler version are you using? I guess it's 2.7?
What do you have in your User home directory under .fscrawler dir?

Hello @dadoonet,

I am using fscrawler version 7.2.7.
In .fscrawler dir, I have 3 folders.

  1. _default
  2. new_attachment (I have indexed my local files)
    3.remote (i am trying to index files remotely.)

Regards,
Priyanka

What do you have in remote dir? And what is the content of the files?

Hello @dadoonet,,

In remote dir, I have _settings.yaml file. And content of that file i have already mentioned in above update.

Regards,
Priyanka

Which exact version did you download? Which file I mean? What is the date?

Hello @dadoonet,

i have downloaded version 7.2.7.
fscrawler-es7-2.7-20190625.065648-37 zip file i have downloaded.

Regards,
Priyanka

Ok. It's a very old one.

Could you use this one? https://oss.sonatype.org/content/repositories/snapshots/fr/pilato/elasticsearch/crawler/fscrawler-es7/2.7-SNAPSHOT/fscrawler-es7-2.7-20200122.065827-76.zip

Hello @dadoonet,

I have downloaded the mentioned version.
in config file, url path should be path from remote server from where we want to index files?

name: "test"
fs:
  url: "/path/to/data/dir/on/server"
server:
  hostname: "mynode.mydomain.com"
  port: 22
  username: "username"
  password: "password"
  protocol: "ssh"

Regards,
Priyanka

This is what will happen behind the scene;

ssh mynode.mydomain.com
cd /path/to/data/dir/on/server
ls

Hello @dadoonet,

I am getting an error :

05:50:08,288 WARN [f.p.e.c.f.FsParserAbstract] Error while crawling /path/to/data/dir/on/server: /path/to/data/dir/on/server doesn't exists.

and i am not able to ssh using cmd on my server.

'ssh' is not recognized as an internal or external command,
operable program or batch file.

Also i am not able telnet port 22.

Regards,
Priyanka

Most likely there is no ssh service on the machine you want to index.
So it can not work.

Hello @dadoonet,

Yes, i need to open port also.

Regards,
Priyanka

Yes.