I am working with an application that requires indexing thousands of files' content to make the content searchable. I use tika to extract the file content and then index it to Elasticsearch. I also tried to ingest pipeline to index the file content, but it wasn't scalable for me. Also, sometimes timeout or machine heap size causes problems.
To support customers, we need a scalable solution to index millions of files' content without creating any problems.
Please provide me your invaluable suggestions to get to the right solution.
You could give a try to FSCrawler but it's not yet multithreaded. So you need to start one instance per subdir for example.
This will change in the future.
@dadoonet, Do you have any code sample or documentation focused on this problem? It will help me to create a quick POC around this. Thanks for your prompt response, and I appreciate your help and guidance.
David, When you say one instance per subdir means one instance of Elasticsearch, right?
No, one instance of FSCrawler per subdir.
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.