Achieving a books dot google dot com workflow with Elasticsearch as search engine, any indexer and tesseract OCR tool?

Chatgpt and deepseek hallucinated on this so asking here:

  • 10s of pdfs is downloaded.
  • I want to search for a particular topic inside the content of pdf.
  • I search it.
  • Then I want to read that pdf going to that exact page where the "content" that I wanted was found.

Something similar to the search engine of books.google.com. I researched a bit and found google uses tesseract to convert pdfs to text. Now, my concern is how to integerate this workflow with an indexer like Lucene and searching engine like Elasticsearch. I am absolutely new to all these three things. I'd want a starting point. And how complicated will this gonna be An general idea. I am not looking for step by step guide.

You can use the ingest attachment plugin.

There an example here: https://www.elastic.co/guide/en/elasticsearch/plugins/current/using-ingest-attachment.html

PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data"
      }
    }
  ]
}
PUT my_index/_doc/my_id?pipeline=attachment
{
  "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}
GET my_index/_doc/my_id

The data field is basically the BASE64 representation of your binary file.

You can use FSCrawler. There's a tutorial to help you getting started.

But none of those features will tell you exactly the page number where the text was found. At least, not yet with FSCrawler. This would require this to be implemented:

And also Kibana now supports directly uploading PDF files. See this very nice blog post: