FSCrawler - Best approach to load massive amount of document

classless_oak · September 22, 2021, 3:50pm

Hello, I am trying to come up with a solution to load massive amount of documents (xlx, xlsx, doc, xml, txt, pdf, etd.) 30+ terabytes into Elastic.

I have spun up the stack with the FSCrawler and have been adding documents to the folder to be indexed. It takes a lot of time and space, I am trying to figure out how to delete the files automatically after they were indexed and if there is a better approach to my problem with the amount of documents. Also, what would be a good way to monitor progress of the index job? Thank you!

dadoonet · September 23, 2021, 4:33pm

Welcome!

w00t! I have never heard about a project with FSCrawler at this scale.
That's super interesting.

Yeah. FSCrawler is still using a single thread. This has been asked a long time ago (Support parallel crawling · Issue #627 · dadoonet/fscrawler · GitHub) but this requires a lot of refactoring.

You could in the meantime launch multiple FSCrawler, one per subfolder of the parent folder. That way, you will use more CPU than today. Not ideal but that's at least a solution.

It takes a lot of [...] space

You mean on Elasticsearch side? Are you storing the binary by any chance? (https://fscrawler.readthedocs.io/en/fscrawler-2.7/admin/fs/local-fs.html#storing-binary-source-document)
What do you mean by a lot of space? How much does it represent when you compare to the original binary documents?
What are your job settings?

I am trying to figure out how to delete the files automatically

There's not such an option. And I don't believe this is something which should be added to the project as it looks too dangerous to me. Today, the source dir only needs to be readable.

Also, what would be a good way to monitor progress of the index job?

Sadly, this feature is not yet there. There have been some demands regarding this:

classless_oak · September 24, 2021, 2:25pm

David, thank you for a very thorough response. This is a great project and exactly what I was looking for. I am not sure myself how it will work on such scale but I will try to make it happen.

What do you mean by a lot of space? How much does it represent when you compare to the original binary documents?
What are your job settings?

I think I have sorted this out, and I can see Elastic's compression working as well

I am trying to figure out how to delete the files automatically

I was asking this because all my files are in archives and I am trying to come up with a code solution to rotate them for fscrawler.

Thank you for posting issue tickets, I will try to contribute what I can

system · October 22, 2021, 2:26pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
FSCrawler large document and indexing based on content Elasticsearch	4	2375	December 28, 2017
FScrawler stuck at 2.6gb index size Elasticsearch	17	774	January 9, 2020
Efficient Metadata Indexing for Large Filesystem in Elasticsearch Elasticsearch	2	140	April 23, 2024
Filesearch solution using ES 5.5.0 Elasticsearch	13	1716	August 30, 2017
Enhance performance when using FSCrawler and Elasticsearch together Elasticsearch	2	1444	January 6, 2019

FSCrawler - Best approach to load massive amount of document

Related topics