I am trying to Index PDF files in elastic search using Java code. So far I have written following code to save the pdf in ES. The code is working fine and I am able to save the Base64 encoded string of my PDF in ES. I want to understand if the approach which I am following is correct or not? Is there any better way of doing it?
Following is my code:
InputStream inputStream = new FileInputStream(new File("mypdf.pdf"));
try {
byte[] fileByteStream = IOUtils.toByteArray(inputStream );
String base64String = new String(Base64.getEncoder().encodeToString(fileByteStream).getBytes(),"UTF-8");
String strEncoded = Base64.getEncoder().encodeToString( base64String.getBytes( "utf-8" ));
this.stream.close();
JSONObject correspondenceNode = new JSONObject();
correspondenceNode.put("data",strEncoded );
String strSsonValues = correspondenceNode.toString();
HttpEntity entity = new NStringEntity(strSsonValues , ContentType.APPLICATION_JSON);
elasticrestClient.put("/2018/documents/"1, entity);
} catch (IOException e) {
e.printStackTrace();
}
Basically what I am doing here is, I am converting the PDF document into Base64String and saving it into ES and while reading, I am converting it back.
following is the code for decoding:
String responseBody = elasticrestClient.get("/2018/documents/1");
//some code to fetch the hits
JSONObject h = hitsArray.getJSONObject(0);
source = h.getJSONObject("_source");
String object = (source.getString("data"));
byte[] decodedStr = Base64.getDecoder().decode( object );
FileOutputStream fos = new FileOutputStream("download.pdf");
fos.write(Base64.getDecoder().decode(new String( decodedStr, "utf-8" )));
fos.close();