Can fscrawler index files from different servers?

Hi,
I am new to Elasticsearch and fscrawler, could you please let me know what are the settings I need to mention in the _settings.yaml (in fscrawler index) file so that I can index files from multiple servers.

I tried with single server, it is working! but I have files in multiple servers.
here is the single server settings file.

CODE
name: "books"
fs:
  url: "/var/www/html/file-scanner/ESFiles"
  update_rate: "15m"
  excludes:
  - "*/~*"
  json_support: false
  filename_as_id: false
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: false
  store_source: false
  index_content: true
  attributes_support: false
  raw_metadata: false
  xml_support: false
  index_folders: true
  lang_detect: false
  continue_on_error: false
  ocr:
    language: "eng"
    enabled: true
    pdf_strategy: "ocr_and_text"
  follow_symlinks: false
server:
  hostname: "dev2.com"
  port: 22
  username: "swati"
  password: "password123"
elasticsearch:
  nodes:
  - url: "http://127.0.0.1:9200"
  bulk_size: 100
  flush_interval: "5s"
  byte_size: "10mb"

The versions of Elastics search version - 7.5.1, fsCrawler - 7.2.7 and they are running in one server and my documents are in 3 different servers.

Thanks,
Swati

Welcome!

You should run one FSCrawler instance per server. Would that work?
So 3 configuration files and you launch 3 FSCrawler instance, one per configuration file.

Thank you so much for your quick reply David!
yes I can run one FScrawler per server.
Just clear this doubt, can I run it in the same document server?

-- Thanks,
Swati

Just to be clear,
I have server1 with ES, Fscrawler,
Server2, server3 with pdf files in it.
So, can I run FScrawler on server2 and server3 or do I need new servers?

--Thanks,
Swati

You don't need new servers unless you don't have enough free memory.

Thank you so much!
Can I ask how much memory is needed to run FScrawler per server for 1GB files?

Becuase, yesterday I was running ES, FScrawler on one server which is of 2GB with files of 10MB in it. When I ran FScrawler, it is displaying as "got a hard failure" and ES is stopped. This may be because of memory, so I will work on a new machine with 8GB today. But to confirm the root cause I will open another ticket for that issue.
So,

  1. Could you please tell me how much memory a server needed when running ES, FScrawler and 1 GB files are in same server.
  2. again, how much memory needed to run FScrawler in document server (which has 1GB files).

Thanks,
Swati

In production, you should separate IMO ES from everything else. It should be alone on a machine. For elasticsearch size, It also depends on the size of the extracted text and if you are keeping the binary document (BASE64) or not (I recommend not storing the BASE64 document in elasticsearch, specifically with 1gb files).

I have no idea. But dealing with so big files will require a lot of memory probably. At the very least I'd bet on 4gb. But probably much more.

Could you tell what are those files? What do they contain?

Right now, I am using some textbooks (pdfs mainly) of 10MB, but I want to try with 1GB of different type of files, to see the performance of Elastic search and FScrawler.
Do you have any other suggestions, other than one FScrawler per instance for indexing files from multiple servers?

Do you mean that you have PDF documents of 1gb size? Is that realistic?

No. It's not supported as we speak. And I'd prefer having one FSCrawler instance running per directory to monitor instead of one single instance for many dirs.

Hi David,

Its not only PDF files, we have different types of documents such as doc, excel, zip files etc in different servers.
I want to use elastic search to access them.
Thank you so much for your reply!
you can close this ticket!!

Thanks,
Swati

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.