Indexing PDF file in ElasticSearch using Java Code

dadoonet · July 31, 2018, 8:59pm

This might be correct to store a BASE64 content in elasticsearch but few pieces might be missing here:

You are not "indexing" the PDF as per say in Elasticsearch. If you want to do so, you need to define an ingest pipeline and use the ingest attachment plugin to extract the content from the PDF.
You did not speak about the mapping you are using. If you "really" want to keep the binary content around, you might want to define the BASE64 field as a binary data type.
It does not sound to me a good idea to use elasticsearch to store large blobs like this.

Instead, I'd extract text and metadata and index that + an URL to the binary itself. Like:

{
  "content": "Extracted text here",
  "meta": {
    // Meta data there
  },
  "url": "file://path/to/file"
}

You can also look at FSCrawler (including its code) which does basically that.

Topic		Replies	Views
What is the curl command to convert pdf into base64 format? Elasticsearch	15	3537	April 12, 2019
Indexing PDFs directly Elasticsearch	4	717	October 14, 2019
Indexing pdf documents Elasticsearch	2	5239	December 27, 2016
How to index and store pdf file in elastic search using spring boot? Elasticsearch	51	12744	April 21, 2020
How to specify file to Ingest Attachment Elasticsearch	11	4878	March 21, 2017