So I am trying to index about 2.5TB of data using fscrawler that is about 3 million files.. I have 40GB of ram of which I have set aside 20GB heap for fscrawler for maximum throughput.
C:\Elastic\fscrawler-MAR15\bin>fscrawler trial2
14:58:51,919 INFO [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [18.8gb/19.1gb=98.43%], RAM [37gb/40.9gb=90.53%], Swap [1.8gb/47.1gb=3.97%].
14:58:52,998 INFO [f.p.e.c.f.c.v.ElasticsearchClientV7] Elasticsearch Client for version 7.x connected to a node running version 7.1.1
14:58:53,185 INFO [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
14:58:53,185 INFO [f.p.e.c.f.FsCrawlerImpl] FS crawler started in watch mode. It will run unless you stop it with CTRL+C.
However, only about 1 million of data has been indexed so far with the index size remaining at 4GB for the last 3 weeks. I don't know if indexing is going on or it has stalled. (kindly also explain to me what the swap memory is with regard to fscrawler, does mine which is 1.8gb affect the performance?)
N/B I once had to restart indexing because I found an error "your computer is low on memory.. save files and close programs. Java (TM) Platform binary".
kindly advice me on the indexing situation and memory.
thank you.
Please don't post images of text as they are hard to read, may not display correctly for everyone, and are not searchable.
Instead, paste the text and format it with </> icon or pairs of triple backticks (```), and check the preview window to make sure it's properly formatted before posting it. This makes it more likely that your question will receive a useful answer.
It would be great if you could update your post to solve this.
kindly also explain to me what the swap memory is
It's when you don't have enough RAM. OS can use the hard disk as a memory. But this makes everything super slow.
The important number to look at is the number of documents that needs to be indexed.
You can't really compare the source size and the size in elasticsearch as only the extracted content is indexed.
So I can see that almost 1m documents have been indexed by FSCrawler. Do you know how many documents should have been indexed?
FSCrawler visited somehow 100k folders.
Sadly, there is no "progress report" available yet in FSCrawler.
The only way to know is by starting FSCrawler with the --debug option.
Immediately after posting this question, I got another memory full error.. What could be causing this? Because from the task manager, memory used is at about 30% . Is my heap and ram sufficient?
That's strange that Windows complains about the memory usage.
I'd expect an OutOfMemory java exception instead.
The memory is supposed to be allocated and available for the process.
Is there any option on windows machines to make sure that a process can actually lock the memory?
So I think I found the error.. my c drive was full (location of my index).. after reboot I found 0 bytes available.. so I added more memory and tried again... Am monitoring it and it seem to be running fine..
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.