Fscrawler for ES clustering

pyerunka · January 29, 2020, 4:35am

Hello,

I have configured ES cluster with 3 nodes.
I want to index files using fscrawler.
Is there any setting i need to do to mention all 3 nodes in fscrawler config file?

Thanks,
Priyanka

dadoonet · January 29, 2020, 7:25am

Hey

Have a look here:

https://fscrawler.readthedocs.io/en/latest/admin/fs/elasticsearch.html#node-settings

There's an example of this.

pyerunka · January 29, 2020, 12:05pm

Hello @dadoonet,

Thanks for your reply!!!!

One more question, I want to index files that are located on another server.
That means my indexing job will run from one server but files should be indexed from another server.
So is there any options for such a condition or we can mentioned path directly in url ?

Regards,
Priyanka

dadoonet · January 29, 2020, 1:00pm

To index remote files you can:

run FSCrawler on the remote machine (indexing local files)
mount the remote dir on the machine where FSCrawler is running
use ssh to index remotely

The first solution is the best IMO.

pyerunka · January 30, 2020, 8:55am

Hello @dadoonet,

I am trying to use SSH. But it is not working for me.

My config file:

---
name: "remote"
fs:
  url: "C:\\tmp\\priyanka"
server: 
    hostname: "Myremoteserver.com"
	username: "myusername"
	password: "Mypassword"
    port: 22
    protocol: "ssh"
  update_rate: "15m"
  excludes:
  - "*/~*"
  json_support: false
  filename_as_id: false
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: false
  store_source: false
  index_content: true
  attributes_support: false
  raw_metadata: false
  xml_support: false
  index_folders: true
  lang_detect: false
  continue_on_error: false
  ocr:
    language: "eng"
    enabled: true
    pdf_strategy: "ocr_and_text"
  follow_symlinks: false
elasticsearch:
  nodes:
  - url: "http://127.0.0.1:9200"
  bulk_size: 100
  flush_interval: "5s"
  byte_size: "10mb"

Regards,
Priyanka

dadoonet · January 30, 2020, 9:16am

Could you share FSCrawler logs?
Start with the --debug option.

pyerunka · January 30, 2020, 9:23am

Hello @dadoonet,

I have already created fscrawler job. And updated settings.yaml file.
When am trying to run that same job, it is showing that , this job is not exists. Do you want to create it again.

fscrawler remote --debug
09:17:35,051 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [2/_settings.json] already exists
09:17:35,055 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [2/_settings_folder.json]already exists
09:17:35,055 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [5/_settings.json] already exists
09:17:35,055 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [5/_settings_folder.json]already exists
09:17:35,055 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings.json] already exists
09:17:35,055 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings_folder.json]already exists
09:17:35,055 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [7/_settings.json] already exists
09:17:35,055 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [7/_settings_folder.json]already exists
09:17:35,055 DEBUG [f.p.e.c.f.c.FsCrawlerCli] Starting job [remote]...
09:17:35,347 WARN  [f.p.e.c.f.c.FsCrawlerCli] job [remote] does not exist
09:17:35,347 INFO  [f.p.e.c.f.c.FsCrawlerCli] Do you want to create it (Y/N)?

Regards,
Priyanka

dadoonet · January 30, 2020, 10:21am

Which FSCrawler version are you using? I guess it's 2.7?
What do you have in your User home directory under .fscrawler dir?

pyerunka · January 30, 2020, 10:30am

Hello @dadoonet,

I am using fscrawler version 7.2.7.
In .fscrawler dir, I have 3 folders.

_default
new_attachment (I have indexed my local files)
3.remote (i am trying to index files remotely.)

Regards,
Priyanka

dadoonet · January 30, 2020, 10:49am

What do you have in remote dir? And what is the content of the files?

pyerunka · January 30, 2020, 10:53am

Hello @dadoonet,,

In remote dir, I have _settings.yaml file. And content of that file i have already mentioned in above update.

Regards,
Priyanka

dadoonet · January 30, 2020, 10:55am

Which exact version did you download? Which file I mean? What is the date?

pyerunka · January 30, 2020, 10:58am

Hello @dadoonet,

i have downloaded version 7.2.7.
fscrawler-es7-2.7-20190625.065648-37 zip file i have downloaded.

Regards,
Priyanka

dadoonet · January 30, 2020, 11:13am

Ok. It's a very old one.

Could you use this one? https://oss.sonatype.org/content/repositories/snapshots/fr/pilato/elasticsearch/crawler/fscrawler-es7/2.7-SNAPSHOT/fscrawler-es7-2.7-20200122.065827-76.zip

pyerunka · January 30, 2020, 12:25pm

Hello @dadoonet,

I have downloaded the mentioned version.
in config file, url path should be path from remote server from where we want to index files?

name: "test"
fs:
  url: "/path/to/data/dir/on/server"
server:
  hostname: "mynode.mydomain.com"
  port: 22
  username: "username"
  password: "password"
  protocol: "ssh"

Regards,
Priyanka

dadoonet · January 30, 2020, 12:41pm

This is what will happen behind the scene;

ssh mynode.mydomain.com
cd /path/to/data/dir/on/server
ls

pyerunka · January 31, 2020, 6:42am

Hello @dadoonet,

I am getting an error :

05:50:08,288 WARN [f.p.e.c.f.FsParserAbstract] Error while crawling /path/to/data/dir/on/server: /path/to/data/dir/on/server doesn't exists.

and i am not able to ssh using cmd on my server.

'ssh' is not recognized as an internal or external command,
operable program or batch file.

Also i am not able telnet port 22.

Regards,
Priyanka

dadoonet · January 31, 2020, 8:39am

Most likely there is no ssh service on the machine you want to index.
So it can not work.

pyerunka · January 31, 2020, 8:41am

Hello @dadoonet,

Yes, i need to open port also.

Regards,
Priyanka

dadoonet · January 31, 2020, 9:09am

Yes.

Topic		Replies	Views
Pointing FSCrawler to a separate server for documents Elasticsearch	11	2233	November 24, 2017
Unable to send data using FSCrawler to ElasticSearch Elasticsearch	15	3166	August 6, 2018
Error in elastic search cluster Elasticsearch	9	390	July 6, 2017
Remote access through SSH Elasticsearch	9	8859	July 6, 2017
FSCrawler path to remote dir Elasticsearch	32	1624	April 9, 2020

Fscrawler for ES clustering

Related topics