Hi,
I am new to Elasticsearch and fscrawler, could you please let me know what are the settings I need to mention in the _settings.yaml (in fscrawler index) file so that I can index files from multiple servers.
I tried with single server, it is working! but I have files in multiple servers.
here is the single server settings file.
You should run one FSCrawler instance per server. Would that work?
So 3 configuration files and you launch 3 FSCrawler instance, one per configuration file.
Just to be clear,
I have server1 with ES, Fscrawler,
Server2, server3 with pdf files in it.
So, can I run FScrawler on server2 and server3 or do I need new servers?
Thank you so much!
Can I ask how much memory is needed to run FScrawler per server for 1GB files?
Becuase, yesterday I was running ES, FScrawler on one server which is of 2GB with files of 10MB in it. When I ran FScrawler, it is displaying as "got a hard failure" and ES is stopped. This may be because of memory, so I will work on a new machine with 8GB today. But to confirm the root cause I will open another ticket for that issue.
So,
Could you please tell me how much memory a server needed when running ES, FScrawler and 1 GB files are in same server.
again, how much memory needed to run FScrawler in document server (which has 1GB files).
In production, you should separate IMO ES from everything else. It should be alone on a machine. For elasticsearch size, It also depends on the size of the extracted text and if you are keeping the binary document (BASE64) or not (I recommend not storing the BASE64 document in elasticsearch, specifically with 1gb files).
I have no idea. But dealing with so big files will require a lot of memory probably. At the very least I'd bet on 4gb. But probably much more.
Could you tell what are those files? What do they contain?
Right now, I am using some textbooks (pdfs mainly) of 10MB, but I want to try with 1GB of different type of files, to see the performance of Elastic search and FScrawler.
Do you have any other suggestions, other than one FScrawler per instance for indexing files from multiple servers?
Do you mean that you have PDF documents of 1gb size? Is that realistic?
No. It's not supported as we speak. And I'd prefer having one FSCrawler instance running per directory to monitor instead of one single instance for many dirs.
Its not only PDF files, we have different types of documents such as doc, excel, zip files etc in different servers.
I want to use elastic search to access them.
Thank you so much for your reply!
you can close this ticket!!
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.