Can fscrawler index files from different servers?

SwatiVavilapalli · January 13, 2020, 10:25pm

Hi,
I am new to Elasticsearch and fscrawler, could you please let me know what are the settings I need to mention in the _settings.yaml (in fscrawler index) file so that I can index files from multiple servers.

I tried with single server, it is working! but I have files in multiple servers.
here is the single server settings file.

CODE

name: "books"
fs:
  url: "/var/www/html/file-scanner/ESFiles"
  update_rate: "15m"
  excludes:
  - "*/~*"
  json_support: false
  filename_as_id: false
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: false
  store_source: false
  index_content: true
  attributes_support: false
  raw_metadata: false
  xml_support: false
  index_folders: true
  lang_detect: false
  continue_on_error: false
  ocr:
    language: "eng"
    enabled: true
    pdf_strategy: "ocr_and_text"
  follow_symlinks: false
server:
  hostname: "dev2.com"
  port: 22
  username: "swati"
  password: "password123"
elasticsearch:
  nodes:
  - url: "http://127.0.0.1:9200"
  bulk_size: 100
  flush_interval: "5s"
  byte_size: "10mb"

The versions of Elastics search version - 7.5.1, fsCrawler - 7.2.7 and they are running in one server and my documents are in 3 different servers.

Thanks,
Swati

dadoonet · January 14, 2020, 1:12am

Welcome!

You should run one FSCrawler instance per server. Would that work?
So 3 configuration files and you launch 3 FSCrawler instance, one per configuration file.

SwatiVavilapalli · January 14, 2020, 4:04am

Thank you so much for your quick reply David!
yes I can run one FScrawler per server.
Just clear this doubt, can I run it in the same document server?

-- Thanks,
Swati

SwatiVavilapalli · January 14, 2020, 4:07am

Just to be clear,
I have server1 with ES, Fscrawler,
Server2, server3 with pdf files in it.
So, can I run FScrawler on server2 and server3 or do I need new servers?

--Thanks,
Swati

dadoonet · January 14, 2020, 6:23am

You don't need new servers unless you don't have enough free memory.

SwatiVavilapalli · January 14, 2020, 10:51am

Thank you so much!
Can I ask how much memory is needed to run FScrawler per server for 1GB files?

Becuase, yesterday I was running ES, FScrawler on one server which is of 2GB with files of 10MB in it. When I ran FScrawler, it is displaying as "got a hard failure" and ES is stopped. This may be because of memory, so I will work on a new machine with 8GB today. But to confirm the root cause I will open another ticket for that issue.
So,

Could you please tell me how much memory a server needed when running ES, FScrawler and 1 GB files are in same server.
again, how much memory needed to run FScrawler in document server (which has 1GB files).

Thanks,
Swati

dadoonet · January 14, 2020, 11:12am

In production, you should separate IMO ES from everything else. It should be alone on a machine. For elasticsearch size, It also depends on the size of the extracted text and if you are keeping the binary document (BASE64) or not (I recommend not storing the BASE64 document in elasticsearch, specifically with 1gb files).

I have no idea. But dealing with so big files will require a lot of memory probably. At the very least I'd bet on 4gb. But probably much more.

Could you tell what are those files? What do they contain?

SwatiVavilapalli · January 14, 2020, 4:08pm

Right now, I am using some textbooks (pdfs mainly) of 10MB, but I want to try with 1GB of different type of files, to see the performance of Elastic search and FScrawler.
Do you have any other suggestions, other than one FScrawler per instance for indexing files from multiple servers?

dadoonet · January 15, 2020, 7:56am

Do you mean that you have PDF documents of 1gb size? Is that realistic?

No. It's not supported as we speak. And I'd prefer having one FSCrawler instance running per directory to monitor instead of one single instance for many dirs.

SwatiVavilapalli · January 15, 2020, 2:08pm

Hi David,

Its not only PDF files, we have different types of documents such as doc, excel, zip files etc in different servers.
I want to use elastic search to access them.
Thank you so much for your reply!
you can close this ticket!!

Thanks,
Swati

system · February 12, 2020, 2:08pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Pointing FSCrawler to a separate server for documents Elasticsearch	11	2236	November 24, 2017
Enhance performance when using FSCrawler and Elasticsearch together Elasticsearch	2	1488	January 6, 2019
Error while indexing documents into ES using Fscrawler Elasticsearch	6	2652	December 9, 2018
FSCrawler Question Elasticsearch	7	3125	March 17, 2017
Fscrawler does not index to ES with https Elasticsearch	4	1053	October 27, 2020

Can fscrawler index files from different servers?

Related topics