I'm working with a mounted network filesystem on Linux that contains a massive collection of files (text, images, videos, etc.). My goal is to index specific metadata from these files into Elasticsearch.
Specifically, I need the following fields:
Filename
Creation Date
File Path
Extension
I'm not interested in content indexing and only require the listed metadata. Additionally, certain folders within the filesystem will be updated over time.
I've explored FSCrawler for this task but found it doesn't scale well with large numbers of files and tends to get stuck. I'm looking for a more robust solution that can:
Monitor specific folders
Index metadata of new files as they're added
Handle a massive volume of files efficiently
Any guidance or recommendations would be greatly appreciated!
FWIW with FSCrawler, it does not scale well indeed the way it's implemented today.
The workaround is to create more than one FSCrawler instance, let say one per dir on the root dir.
I've been experimenting with FSCrawler and found it can index ~18 million files in 17 hours. To optimize this, I'm considering a hybrid approach:
Pre-indexing static files: Use FSCrawler with includes/excludes regex to index known, unchanging content.
Scripted FSCrawler for dynamic folders: Since I know when new folders with data will be added (and that these periods are short, e.g., 5 weeks), I'd like to use a script to generate FSCrawler settings and trigger it automatically before data is added.
Question: Do you think this hybrid approach is a reasonable solution? Are there potential pitfalls or better alternatives I should consider?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.