Chatgpt and deepseek hallucinated on this so asking here:
10s of pdfs is downloaded.
I want to search for a particular topic inside the content of pdf.
I search it.
Then I want to read that pdf going to that exact page where the "content" that I wanted was found.
Something similar to the search engine of books.google.com. I researched a bit and found google uses tesseract to convert pdfs to text. Now, my concern is how to integerate this workflow with an indexer like Lucene and searching engine like Elasticsearch. I am absolutely new to all these three things. I'd want a starting point. And how complicated will this gonna be An general idea. I am not looking for step by step guide.
But none of those features will tell you exactly the page number where the text was found. At least, not yet with FSCrawler. This would require this to be implemented:
And also Kibana now supports directly uploading PDF files. See this very nice blog post:
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.