Why is my index not growing past 4GB when using fscrawler?

Denn · May 15, 2020, 1:54pm

So I am trying to index about 2.5TB of data using fscrawler that is about 3 million files.. I have 40GB of ram of which I have set aside 20GB heap for fscrawler for maximum throughput.

C:\Elastic\fscrawler-MAR15\bin>fscrawler trial2
14:58:51,919 INFO  [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [18.8gb/19.1gb=98.43%], RAM [37gb/40.9gb=90.53%], Swap [1.8gb/47.1gb=3.97%].
14:58:52,998 INFO  [f.p.e.c.f.c.v.ElasticsearchClientV7] Elasticsearch Client for version 7.x connected to a node running version 7.1.1
14:58:53,185 INFO  [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
14:58:53,185 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler started in watch mode. It will run unless you stop it with CTRL+C.

However, only about 1 million of data has been indexed so far with the index size remaining at 4GB for the last 3 weeks. I don't know if indexing is going on or it has stalled. (kindly also explain to me what the swap memory is with regard to fscrawler, does mine which is 1.8gb affect the performance?)

N/B I once had to restart indexing because I found an error "your computer is low on memory.. save files and close programs. Java (TM) Platform binary".

kindly advice me on the indexing situation and memory.
thank you.

dadoonet · May 15, 2020, 2:13pm

Please don't post images of text as they are hard to read, may not display correctly for everyone, and are not searchable.

Instead, paste the text and format it with </> icon or pairs of triple backticks (```), and check the preview window to make sure it's properly formatted before posting it. This makes it more likely that your question will receive a useful answer.

It would be great if you could update your post to solve this.

kindly also explain to me what the swap memory is

It's when you don't have enough RAM. OS can use the hard disk as a memory. But this makes everything super slow.

The important number to look at is the number of documents that needs to be indexed.
You can't really compare the source size and the size in elasticsearch as only the extracted content is indexed.
So I can see that almost 1m documents have been indexed by FSCrawler. Do you know how many documents should have been indexed?
FSCrawler visited somehow 100k folders.

Sadly, there is no "progress report" available yet in FSCrawler.

The only way to know is by starting FSCrawler with the --debug option.

Denn · May 15, 2020, 2:45pm

I have tried to update the question.

Immediately after posting this question, I got another memory full error.. What could be causing this? Because from the task manager, memory used is at about 30% . Is my heap and ram sufficient?

dadoonet · May 15, 2020, 3:07pm

Could you share it?
What are the fscrawler job settings?

Denn · May 15, 2020, 5:50pm

I am sorry but I can only use a photo here.
This is the error !
tymHb|367x289

My job settings looks like this:

---
name: "index"
fs:
  url: "\\\\DESKTOP-O5VVOG6\\shared_docs"
  update_rate: "15m"
  excludes:
  - "*/~*"
  json_support: false
  filename_as_id: false
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: false
  store_source: false
  index_content: true
  attributes_support: false
  raw_metadata: false
  xml_support: false
  index_folders: true
  lang_detect: false
  continue_on_error: true
  ocr:
    language: "eng"
    enabled: true
    pdf_strategy: "ocr_and_text"
elasticsearch:
  nodes:
  - url: "http://127.0.0.1:9200"
  bulk_size: 100
  flush_interval: "5s"
  byte_size: "10mb"

dadoonet · May 18, 2020, 9:58am

That's strange that Windows complains about the memory usage.
I'd expect an OutOfMemory java exception instead.

The memory is supposed to be allocated and available for the process.
Is there any option on windows machines to make sure that a process can actually lock the memory?

Denn · May 18, 2020, 5:43pm

So I think I found the error.. my c drive was full (location of my index).. after reboot I found 0 bytes available.. so I added more memory and tried again... Am monitoring it and it seem to be running fine..

system · June 15, 2020, 5:44pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
FScrawler stuck at 2.6gb index size Elasticsearch	17	809	January 9, 2020
FSCrawler, qui crash elasticsearch Discussions en français	12	1302	April 23, 2019
Request Entity Too Large when index file json has size large 100mb Elasticsearch	5	1829	November 6, 2019
Elasticsearch fscrawler laravel Elasticsearch	1	22	August 8, 2024
Elasticsearch service is crashing while indexing Elasticsearch	6	933	June 11, 2020

Why is my index not growing past 4GB when using fscrawler?

Related topics