Eg. I have data close to 500GB to be indexed using fsCrawler through default configurations. How can i calculate the system requirements for the following,
How much disk storage is required on the device ?
What will be the RAM requirements for the system ?
Processing requirements for the system ?
Any formula to calculate will be of great help.
There is no such formula, specifically in the context of binary documents where you can't predict how much images vs text you have.
It depends.
It depends.
It depends. If you are doing OCR or not for example. And how fast you want it to be. Note that for now FSCrawler is not multithreaded. So if you want to run in parallel you need to start I believe multiple instances in parallel.
In term of elasticsearch, well elasticsearch is "just" going to index documents that has been processed by FSCrawler so the quantity of work is not that big. But as you are going to extract may be a lot of text per document, that could use still some memory and CPU.
In short, you really need to test this by yourself...
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.