Efficient Metadata Indexing for Large Filesystem in Elasticsearch

geekahmed · April 23, 2024, 6:41am

Hello everyone,

I'm working with a mounted network filesystem on Linux that contains a massive collection of files (text, images, videos, etc.). My goal is to index specific metadata from these files into Elasticsearch.

Specifically, I need the following fields:

Filename
Creation Date
File Path
Extension

I'm not interested in content indexing and only require the listed metadata. Additionally, certain folders within the filesystem will be updated over time.

I've explored FSCrawler for this task but found it doesn't scale well with large numbers of files and tends to get stuck. I'm looking for a more robust solution that can:

Monitor specific folders
Index metadata of new files as they're added
Handle a massive volume of files efficiently

Any guidance or recommendations would be greatly appreciated!

dadoonet · April 23, 2024, 12:19pm

Welcome!

FWIW with FSCrawler, it does not scale well indeed the way it's implemented today.
The workaround is to create more than one FSCrawler instance, let say one per dir on the root dir.

geekahmed · April 23, 2024, 2:05pm

I've been experimenting with FSCrawler and found it can index ~18 million files in 17 hours. To optimize this, I'm considering a hybrid approach:

Pre-indexing static files: Use FSCrawler with includes/excludes regex to index known, unchanging content.
Scripted FSCrawler for dynamic folders: Since I know when new folders with data will be added (and that these periods are short, e.g., 5 weeks), I'd like to use a script to generate FSCrawler settings and trigger it automatically before data is added.

Question: Do you think this hybrid approach is a reasonable solution? Are there potential pitfalls or better alternatives I should consider?

Topic		Replies	Views
Enhance performance when using FSCrawler and Elasticsearch together Elasticsearch	2	1436	January 6, 2019
FSCrawler large document and indexing based on content Elasticsearch	4	2353	December 28, 2017
Index files on files system in Elasticsearch Elasticsearch	3	365	November 13, 2018
Filesearch solution using ES 5.5.0 Elasticsearch	13	1714	August 30, 2017
[ANN] Filesystem River for Elasticsearch 0.0.1 Elasticsearch	5	386	July 6, 2017

Efficient Metadata Indexing for Large Filesystem in Elasticsearch

Related topics