Hello,
I have configured ES cluster with 3 nodes.
I want to index files using fscrawler.
Is there any setting i need to do to mention all 3 nodes in fscrawler config file?
Thanks,
Priyanka
Hello,
I have configured ES cluster with 3 nodes.
I want to index files using fscrawler.
Is there any setting i need to do to mention all 3 nodes in fscrawler config file?
Thanks,
Priyanka
Hey
Have a look here:
https://fscrawler.readthedocs.io/en/latest/admin/fs/elasticsearch.html#node-settings
There's an example of this.
Hello @dadoonet,
Thanks for your reply!!!!
One more question, I want to index files that are located on another server.
That means my indexing job will run from one server but files should be indexed from another server.
So is there any options for such a condition or we can mentioned path directly in url ?
Regards,
Priyanka
To index remote files you can:
The first solution is the best IMO.
Hello @dadoonet,
I am trying to use SSH. But it is not working for me.
My config file:
---
name: "remote"
fs:
url: "C:\\tmp\\priyanka"
server:
hostname: "Myremoteserver.com"
username: "myusername"
password: "Mypassword"
port: 22
protocol: "ssh"
update_rate: "15m"
excludes:
- "*/~*"
json_support: false
filename_as_id: false
add_filesize: true
remove_deleted: true
add_as_inner_object: false
store_source: false
index_content: true
attributes_support: false
raw_metadata: false
xml_support: false
index_folders: true
lang_detect: false
continue_on_error: false
ocr:
language: "eng"
enabled: true
pdf_strategy: "ocr_and_text"
follow_symlinks: false
elasticsearch:
nodes:
- url: "http://127.0.0.1:9200"
bulk_size: 100
flush_interval: "5s"
byte_size: "10mb"
Regards,
Priyanka
Could you share FSCrawler logs?
Start with the --debug
option.
Hello @dadoonet,
I have already created fscrawler job. And updated settings.yaml file.
When am trying to run that same job, it is showing that , this job is not exists. Do you want to create it again.
fscrawler remote --debug
09:17:35,051 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [2/_settings.json] already exists
09:17:35,055 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [2/_settings_folder.json]already exists
09:17:35,055 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [5/_settings.json] already exists
09:17:35,055 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [5/_settings_folder.json]already exists
09:17:35,055 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings.json] already exists
09:17:35,055 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [6/_settings_folder.json]already exists
09:17:35,055 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [7/_settings.json] already exists
09:17:35,055 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] Mapping [7/_settings_folder.json]already exists
09:17:35,055 DEBUG [f.p.e.c.f.c.FsCrawlerCli] Starting job [remote]...
09:17:35,347 WARN [f.p.e.c.f.c.FsCrawlerCli] job [remote] does not exist
09:17:35,347 INFO [f.p.e.c.f.c.FsCrawlerCli] Do you want to create it (Y/N)?
Regards,
Priyanka
Which FSCrawler version are you using? I guess it's 2.7?
What do you have in your User home directory under .fscrawler
dir?
Hello @dadoonet,
I am using fscrawler version 7.2.7.
In .fscrawler
dir, I have 3 folders.
Regards,
Priyanka
What do you have in remote
dir? And what is the content of the files?
Hello @dadoonet,,
In remote dir, I have _settings.yaml file. And content of that file i have already mentioned in above update.
Regards,
Priyanka
Which exact version did you download? Which file I mean? What is the date?
Hello @dadoonet,
i have downloaded version 7.2.7.
fscrawler-es7-2.7-20190625.065648-37 zip file i have downloaded.
Regards,
Priyanka
Ok. It's a very old one.
Could you use this one? https://oss.sonatype.org/content/repositories/snapshots/fr/pilato/elasticsearch/crawler/fscrawler-es7/2.7-SNAPSHOT/fscrawler-es7-2.7-20200122.065827-76.zip
Hello @dadoonet,
I have downloaded the mentioned version.
in config file, url path should be path from remote server from where we want to index files?
name: "test"
fs:
url: "/path/to/data/dir/on/server"
server:
hostname: "mynode.mydomain.com"
port: 22
username: "username"
password: "password"
protocol: "ssh"
Regards,
Priyanka
This is what will happen behind the scene;
ssh mynode.mydomain.com
cd /path/to/data/dir/on/server
ls
Hello @dadoonet,
I am getting an error :
05:50:08,288 WARN [f.p.e.c.f.FsParserAbstract] Error while crawling /path/to/data/dir/on/server: /path/to/data/dir/on/server doesn't exists.
and i am not able to ssh using cmd on my server.
'ssh' is not recognized as an internal or external command,
operable program or batch file.
Also i am not able telnet port 22.
Regards,
Priyanka
Most likely there is no ssh service on the machine you want to index.
So it can not work.
Yes.
© 2020. All Rights Reserved - Elasticsearch
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant logo are trademarks of the Apache Software Foundation in the United States and/or other countries.