FScrawler stuck at 2.6gb index size

Denn · December 9, 2019, 10:53am

I have elasticsearch that uses fscrawler to index a folder that is about 5TB. The _settings.yml has not been modified except that it should update after every 3 hours.

Everything goes well until the index grew to 2.6gb then the index stalled. What could be the problem? Is it something to do with the memory?

dadoonet · December 10, 2019, 12:23am

How many documents have been indexed? How many are missing?

Anything in FSCrawler logs?

Denn · December 10, 2019, 9:06am

Around 800,000 out of about 3,000,00 documents indexed so far.. My server's memory is 40GB and I have allocated about 12GB for elasticsearch. I am running fscrawler on the terminal but then after 2-3 days when I query index details, it appeared to have stalled though there is no error at the terminal.

dadoonet · December 10, 2019, 10:54am

How much heap did you give to FSCrawler?

Denn · December 10, 2019, 8:25pm

I am sorry I don't know how to assign heap to FSCrawler. I just used the default settings. Kindly advice how I can do that.

dadoonet · December 10, 2019, 9:00pm

Here you go: https://fscrawler.readthedocs.io/en/latest/admin/jvm-settings.html

Denn · December 11, 2019, 5:23am

Got it! Thank you.
3 quick ones, what is the default FSCrawler heap size?
how much heap should be sufficient for such amount of data?
and maybe just for clarification, since I have my update_rate at 3hrs, does that mean it will start updating even before it completes indexing?

dadoonet · December 11, 2019, 5:52am

I don't know TBH as I don't think I set it. So it's probably 256m to 1g.
As you have a lot of memory, I'd try with 4 gb.
no. 3h is defined after the end of a run.

Denn · December 11, 2019, 8:51am

Trying to set up heap gives error --> Invalid maximum heap size

dadoonet · December 11, 2019, 10:16am

You don't have enough free memory available on your machine apparently.

Denn · December 11, 2019, 10:18am

.. Do you think I am doing it right? because it gives the same error even with 512mb.

dadoonet · December 11, 2019, 12:56pm

May be you should define a system var using the control panel and restart the console?

Denn · December 11, 2019, 12:59pm

Do you mean like this?

dadoonet · December 11, 2019, 1:15pm

Yes but not with the double quotes and not with bin/fscrawler

Denn · December 12, 2019, 10:30am

So I used the system variable approach and started indexing afresh. I believe that HEAP as shown below belongs to FSCrawler. As you can see, a lot of the memory seems unused.. from the ram to the heap.

I don't know how long it might take to index my 5TB, data have you encountered such amount of data before? How long did it take to index? How do I know if the indexing is still going on or it has hang? because last time there was no progress for more than two days (after about 350GB data indexing, index size was 2.6gb).

dadoonet · December 12, 2019, 1:18pm

No. Never tried with lot of data. I know that some users are also indexing a lot.

Well. That's a good question. I should add a REST endpoint for it so you can know.
Or something in the console writing that work is in progress but without needing --debug option. Would you like to open an issue?

Denn · December 12, 2019, 1:30pm

Yeah sure.

system · January 9, 2020, 1:30pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Why is my index not growing past 4GB when using fscrawler? Elasticsearch	7	1065	June 15, 2020
FSCrawler large document and indexing based on content Elasticsearch	4	2353	December 28, 2017
FSCrawler - Indexing mix of Big and small files - HTTP Entity too large error Elasticsearch	9	198	February 28, 2024
FSCrawler - Best approach to load massive amount of document Elasticsearch	3	736	October 22, 2021
FSCrawler, qui crash elasticsearch Discussions en français	12	1273	April 23, 2019

FScrawler stuck at 2.6gb index size

Related topics