FSCrawler - Best approach to load massive amount of document

Hello, I am trying to come up with a solution to load massive amount of documents (xlx, xlsx, doc, xml, txt, pdf, etd.) 30+ terabytes into Elastic.

I have spun up the stack with the FSCrawler and have been adding documents to the folder to be indexed. It takes a lot of time and space, I am trying to figure out how to delete the files automatically after they were indexed and if there is a better approach to my problem with the amount of documents. Also, what would be a good way to monitor progress of the index job? Thank you!

Welcome!

w00t! I have never heard about a project with FSCrawler at this scale.
That's super interesting.

Yeah. FSCrawler is still using a single thread. This has been asked a long time ago (Support parallel crawling · Issue #627 · dadoonet/fscrawler · GitHub) but this requires a lot of refactoring.

You could in the meantime launch multiple FSCrawler, one per subfolder of the parent folder. That way, you will use more CPU than today. Not ideal but that's at least a solution.

It takes a lot of [...] space

You mean on Elasticsearch side? Are you storing the binary by any chance? (https://fscrawler.readthedocs.io/en/fscrawler-2.7/admin/fs/local-fs.html#storing-binary-source-document)
What do you mean by a lot of space? How much does it represent when you compare to the original binary documents?
What are your job settings?

I am trying to figure out how to delete the files automatically

There's not such an option. And I don't believe this is something which should be added to the project as it looks too dangerous to me. Today, the source dir only needs to be readable.

Also, what would be a good way to monitor progress of the index job?

Sadly, this feature is not yet there. There have been some demands regarding this:

David, thank you for a very thorough response. This is a great project and exactly what I was looking for. I am not sure myself how it will work on such scale but I will try to make it happen.

What do you mean by a lot of space? How much does it represent when you compare to the original binary documents?
What are your job settings?

I think I have sorted this out, and I can see Elastic's compression working as well

I am trying to figure out how to delete the files automatically

I was asking this because all my files are in archives and I am trying to come up with a code solution to rotate them for fscrawler.

Thank you for posting issue tickets, I will try to contribute what I can

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.