Achieving a books dot google dot com workflow with Elasticsearch as search engine, any indexer and tesseract OCR tool?

davecomputertips · March 4, 2025, 2:55pm

Chatgpt and deepseek hallucinated on this so asking here:

10s of pdfs is downloaded.
I want to search for a particular topic inside the content of pdf.
I search it.
Then I want to read that pdf going to that exact page where the "content" that I wanted was found.

Something similar to the search engine of books.google.com. I researched a bit and found google uses tesseract to convert pdfs to text. Now, my concern is how to integerate this workflow with an indexer like Lucene and searching engine like Elasticsearch. I am absolutely new to all these three things. I'd want a starting point. And how complicated will this gonna be An general idea. I am not looking for step by step guide.

dadoonet · March 4, 2025, 3:26pm

You can use the ingest attachment plugin.

There an example here: https://www.elastic.co/guide/en/elasticsearch/plugins/current/using-ingest-attachment.html

PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data"
      }
    }
  ]
}
PUT my_index/_doc/my_id?pipeline=attachment
{
  "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}
GET my_index/_doc/my_id

The data field is basically the BASE64 representation of your binary file.

You can use FSCrawler. There's a tutorial to help you getting started.

But none of those features will tell you exactly the page number where the text was found. At least, not yet with FSCrawler. This would require this to be implemented:

And also Kibana now supports directly uploading PDF files. See this very nice blog post:

Topic		Replies	Views
Can we perform the text search present in the images or pdf files through elasticsearch Elasticsearch	9	3232	July 5, 2017
Possible to Index PDFs by page? Elasticsearch	6	3840	July 6, 2017
Is it necessary to use Ingest Attachment Processor to index pdf files Elasticsearch	28	2486	November 9, 2018
Please let me know if i can use elasticsearch for text search in pdf and word documents Elasticsearch	6	524	July 5, 2017
Search a PDF file using its content Elasticsearch	9	16293	February 11, 2019

Achieving a books dot google dot com workflow with Elasticsearch as search engine, any indexer and tesseract OCR tool?

Related topics