Hello, I am trying to come up with a solution to load massive amount of documents (xlx, xlsx, doc, xml, txt, pdf, etd.) 30+ terabytes into Elastic.
I have spun up the stack with the FSCrawler and have been adding documents to the folder to be indexed. It takes a lot of time and space, I am trying to figure out how to delete the files automatically after they were indexed and if there is a better approach to my problem with the amount of documents. Also, what would be a good way to monitor progress of the index job? Thank you!
You could in the meantime launch multiple FSCrawler, one per subfolder of the parent folder. That way, you will use more CPU than today. Not ideal but that's at least a solution.
I am trying to figure out how to delete the files automatically
There's not such an option. And I don't believe this is something which should be added to the project as it looks too dangerous to me. Today, the source dir only needs to be readable.
Also, what would be a good way to monitor progress of the index job?
Sadly, this feature is not yet there. There have been some demands regarding this:
David, thank you for a very thorough response. This is a great project and exactly what I was looking for. I am not sure myself how it will work on such scale but I will try to make it happen.
What do you mean by a lot of space? How much does it represent when you compare to the original binary documents?
What are your job settings?
I think I have sorted this out, and I can see Elastic's compression working as well
I am trying to figure out how to delete the files automatically
I was asking this because all my files are in archives and I am trying to come up with a code solution to rotate them for fscrawler.
Thank you for posting issue tickets, I will try to contribute what I can
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.