How to solve index millions of files content with scalable approach?

sushilprj · April 24, 2022, 7:25am

I am working with an application that requires indexing thousands of files' content to make the content searchable. I use tika to extract the file content and then index it to Elasticsearch. I also tried to ingest pipeline to index the file content, but it wasn't scalable for me. Also, sometimes timeout or machine heap size causes problems.

To support customers, we need a scalable solution to index millions of files' content without creating any problems.

Please provide me your invaluable suggestions to get to the right solution.

dadoonet · April 24, 2022, 8:41am

You could give a try to FSCrawler but it's not yet multithreaded. So you need to start one instance per subdir for example.

This will change in the future.

sushilprj · April 24, 2022, 9:14am

@dadoonet, Do you have any code sample or documentation focused on this problem? It will help me to create a quick POC around this. Thanks for your prompt response, and I appreciate your help and guidance.

sushilprj · April 24, 2022, 10:16am

David, When you say one instance per subdir means one instance of Elasticsearch, right?

Christian_Dahlqvist · April 24, 2022, 11:44am

No, one instance of FSCrawler per subdir.

sushilprj · April 24, 2022, 11:49am

Thanks, @Christian_Dahlqvist !

system · May 22, 2022, 11:49am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Large zip content Elasticsearch	5	1150	January 23, 2018
Efficient Metadata Indexing for Large Filesystem in Elasticsearch Elasticsearch	2	133	April 23, 2024
How to Index file system Elasticsearch	11	10473	July 5, 2017
Index Db content and linked Filesystem content Elasticsearch	3	669	September 11, 2017
Filesearch solution using ES 5.5.0 Elasticsearch	13	1714	August 30, 2017

How to solve index millions of files content with scalable approach?

Related topics