Indexing PDF file in ElasticSearch using Java Code

preraktiwari · July 30, 2018, 11:31pm

I am trying to Index PDF files in elastic search using Java code. So far I have written following code to save the pdf in ES. The code is working fine and I am able to save the Base64 encoded string of my PDF in ES. I want to understand if the approach which I am following is correct or not? Is there any better way of doing it?
Following is my code:

                InputStream inputStream = new FileInputStream(new File("mypdf.pdf"));
    		try {
    			byte[]  fileByteStream = IOUtils.toByteArray(inputStream );
    			String base64String = new String(Base64.getEncoder().encodeToString(fileByteStream).getBytes(),"UTF-8");
    			String strEncoded = Base64.getEncoder().encodeToString( base64String.getBytes( "utf-8" ));
    			this.stream.close();

                        JSONObject correspondenceNode = new JSONObject(); 
                        correspondenceNode.put("data",strEncoded );

                        String strSsonValues = correspondenceNode.toString();
                        HttpEntity entity = new NStringEntity(strSsonValues , ContentType.APPLICATION_JSON);
                        elasticrestClient.put("/2018/documents/"1, entity);

    		} catch (IOException e) {
    			e.printStackTrace();
        	}

Basically what I am doing here is, I am converting the PDF document into Base64String and saving it into ES and while reading, I am converting it back.

following is the code for decoding:

String responseBody = elasticrestClient.get("/2018/documents/1");
//some code to fetch the hits
JSONObject h = hitsArray.getJSONObject(0);
source = h.getJSONObject("_source");
String object = (source.getString("data"));
byte[] decodedStr = Base64.getDecoder().decode( object );

FileOutputStream fos = new FileOutputStream("download.pdf");
fos.write(Base64.getDecoder().decode(new String( decodedStr, "utf-8" )));
fos.close();

dadoonet · July 31, 2018, 8:59pm

This might be correct to store a BASE64 content in elasticsearch but few pieces might be missing here:

You are not "indexing" the PDF as per say in Elasticsearch. If you want to do so, you need to define an ingest pipeline and use the ingest attachment plugin to extract the content from the PDF.
You did not speak about the mapping you are using. If you "really" want to keep the binary content around, you might want to define the BASE64 field as a binary data type.
It does not sound to me a good idea to use elasticsearch to store large blobs like this.

Instead, I'd extract text and metadata and index that + an URL to the binary itself. Like:

{
  "content": "Extracted text here",
  "meta": {
    // Meta data there
  },
  "url": "file://path/to/file"
}

You can also look at FSCrawler (including its code) which does basically that.

system · August 28, 2018, 8:59pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to store .pdf or .txt file and search in it using Java Elasticsearch language-clients , ingest-pipeline , reindex	9	1337	June 6, 2022
Searching PDF Elasticsearch	5	605	July 6, 2017
How to index text files (pdf, doc, txt...) in Java? Elasticsearch	6	2631	January 18, 2023
How to index and store pdf file in elastic search using spring boot? Elasticsearch	51	12389	April 21, 2020
Indexing pdf documents Elasticsearch	2	5196	December 27, 2016

Indexing PDF file in ElasticSearch using Java Code

Related topics