Efficient Metadata Indexing for Large Filesystem in Elasticsearch

Hello everyone,

I'm working with a mounted network filesystem on Linux that contains a massive collection of files (text, images, videos, etc.). My goal is to index specific metadata from these files into Elasticsearch.

Specifically, I need the following fields:

  • Filename
  • Creation Date
  • File Path
  • Extension

I'm not interested in content indexing and only require the listed metadata. Additionally, certain folders within the filesystem will be updated over time.

I've explored FSCrawler for this task but found it doesn't scale well with large numbers of files and tends to get stuck. I'm looking for a more robust solution that can:

  • Monitor specific folders
  • Index metadata of new files as they're added
  • Handle a massive volume of files efficiently

Any guidance or recommendations would be greatly appreciated!

Welcome!

FWIW with FSCrawler, it does not scale well indeed the way it's implemented today.
The workaround is to create more than one FSCrawler instance, let say one per dir on the root dir.

I've been experimenting with FSCrawler and found it can index ~18 million files in 17 hours. To optimize this, I'm considering a hybrid approach:

  1. Pre-indexing static files: Use FSCrawler with includes/excludes regex to index known, unchanging content.
  2. Scripted FSCrawler for dynamic folders: Since I know when new folders with data will be added (and that these periods are short, e.g., 5 weeks), I'd like to use a script to generate FSCrawler settings and trigger it automatically before data is added.

Question: Do you think this hybrid approach is a reasonable solution? Are there potential pitfalls or better alternatives I should consider?