I am working with an application that requires indexing thousands of files' content to make the content searchable. I use tika to extract the file content and then index it to Elasticsearch. I also tried to ingest pipeline to index the file content, but it wasn't scalable for me. Also, sometimes timeout or machine heap size causes problems.
To support customers, we need a scalable solution to index millions of files' content without creating any problems.
Please provide me your invaluable suggestions to get to the right solution.
@dadoonet, Do you have any code sample or documentation focused on this problem? It will help me to create a quick POC around this. Thanks for your prompt response, and I appreciate your help and guidance.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.