Hello, I am trying to come up with a solution to load massive amount of documents (xlx, xlsx, doc, xml, txt, pdf, etd.) 30+ terabytes into Elastic.
I have spun up the stack with the FSCrawler and have been adding documents to the folder to be indexed. It takes a lot of time and space, I am trying to figure out how to delete the files automatically after they were indexed and if there is a better approach to my problem with the amount of documents. Also, what would be a good way to monitor progress of the index job? Thank you!
You could in the meantime launch multiple FSCrawler, one per subfolder of the parent folder. That way, you will use more CPU than today. Not ideal but that's at least a solution.
It takes a lot of [...] space
You mean on Elasticsearch side? Are you storing the binary by any chance? (Local FS settings — FSCrawler 2.7 documentation)
What do you mean by a lot of space? How much does it represent when you compare to the original binary documents?
What are your job settings?
I am trying to figure out how to delete the files automatically
There's not such an option. And I don't believe this is something which should be added to the project as it looks too dangerous to me. Today, the source dir only needs to be readable.
Also, what would be a good way to monitor progress of the index job?
Sadly, this feature is not yet there. There have been some demands regarding this: