How to solve index millions of files content with scalable approach?

I am working with an application that requires indexing thousands of files' content to make the content searchable. I use tika to extract the file content and then index it to Elasticsearch. I also tried to ingest pipeline to index the file content, but it wasn't scalable for me. Also, sometimes timeout or machine heap size causes problems.

To support customers, we need a scalable solution to index millions of files' content without creating any problems.

Please provide me your invaluable suggestions to get to the right solution.

You could give a try to FSCrawler but it's not yet multithreaded. So you need to start one instance per subdir for example.

This will change in the future.


@dadoonet, Do you have any code sample or documentation focused on this problem? It will help me to create a quick POC around this. Thanks for your prompt response, and I appreciate your help and guidance.

David, When you say one instance per subdir means one instance of Elasticsearch, right?

No, one instance of FSCrawler per subdir.

1 Like

Thanks, @Christian_Dahlqvist !

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.