You can use the FSCrawler REST API and the _simulate endpoint from any application you could write. Then merge the result with the data coming from your database.
I'm saying that you can use the FSCrawler simulate endpoint to get the text from the binary. Once you have this content, you can update the existing JSon documents.
You will end up with documents looking like:
{
"doc_id": "coming from your upload service",
"user_id": "coming from your upload service",
"doc_name": "coming from your upload service",
"file_description": "coming from your upload service",
"tags": ["coming from your upload service", "coming from your upload service"],
"content": "This content is coming from FSCrawler simulate endpoint"
}
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.