FScrawler stuck at 2.6gb index size

I have elasticsearch that uses fscrawler to index a folder that is about 5TB. The _settings.yml has not been modified except that it should update after every 3 hours.

Everything goes well until the index grew to 2.6gb then the index stalled. What could be the problem? Is it something to do with the memory?

How many documents have been indexed? How many are missing?

Anything in FSCrawler logs?

Around 800,000 out of about 3,000,00 documents indexed so far.. My server's memory is 40GB and I have allocated about 12GB for elasticsearch. I am running fscrawler on the terminal but then after 2-3 days when I query index details, it appeared to have stalled though there is no error at the terminal.

How much heap did you give to FSCrawler?

I am sorry I don't know how to assign heap to FSCrawler. I just used the default settings. Kindly advice how I can do that.

Here you go: https://fscrawler.readthedocs.io/en/latest/admin/jvm-settings.html

Got it! Thank you.
3 quick ones, what is the default FSCrawler heap size?
how much heap should be sufficient for such amount of data?
and maybe just for clarification, since I have my update_rate at 3hrs, does that mean it will start updating even before it completes indexing?

  1. I don't know TBH as I don't think I set it. So it's probably 256m to 1g.

  2. As you have a lot of memory, I'd try with 4 gb.

  3. no. 3h is defined after the end of a run.

Trying to set up heap gives error --> Invalid maximum heap size

You don't have enough free memory available on your machine apparently.

.. Do you think I am doing it right? because it gives the same error even with 512mb.

May be you should define a system var using the control panel and restart the console?

Do you mean like this?

Yes but not with the double quotes and not with bin/fscrawler

So I used the system variable approach and started indexing afresh. I believe that HEAP as shown below belongs to FSCrawler. As you can see, a lot of the memory seems unused.. from the ram to the heap.

I don't know how long it might take to index my 5TB, data have you encountered such amount of data before? How long did it take to index? How do I know if the indexing is still going on or it has hang? because last time there was no progress for more than two days (after about 350GB data indexing, index size was 2.6gb).

No. Never tried with lot of data. I know that some users are also indexing a lot.

Well. That's a good question. I should add a REST endpoint for it so you can know.
Or something in the console writing that work is in progress but without needing --debug option. Would you like to open an issue?

Yeah sure.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.