How to index and store pdf file in elastic search using spring boot?

Yes I have gone through that documentation. In that I came across the creation of pipeline using PUT _ingest/pipeline/attachment. But whatever we are creating or inserting is done via kibana or postman.But I would like to create pipeline via my java code.

I also have one more doubt.As u mentioned ingest plugin is used to extract the info from pdf, but here my main concern is to store the files into elastic search

The documentation is here: Create or update pipeline API | Java REST Client [7.17] | Elastic

Well. I'm always hesitating doing that. Elasticsearch is not ideal to store big binary blobs. So I'd say that it can be ok if the size of your documents is limited to say 10kb.
If you are planning to store mega bytes of documents, that's another story.

It comes with a lot of costs. Like the network bandwidth when you are going to fetch the 10 top relevant results. 1mb * 10 equals 10mb for each search response...
You need to think about it...

I prefer storing the binary in another datastore or on a file system and just index in Elasticsearch the extracted content and the metadata.

1 Like

Okay thanku so much for ur time

Our file size would be max 130kb. So can we store them by using ingest plugin?

Sounds like possible to me.

but note that there is no relationship between "storing" the binary in elasticsearch and indexing its content in elasticsearch.

Yon can do:

  • index content (with the ingest attachment plugin)
  • store the file
  • index content (with the plugin) and store the content

So it depends on the use case I'd say.

I dont want to index the content. I would like to index the file. Can we store direct file into ES?

What does mean "index the file"?

Indexing and storing are different things.

Giving some index to the file so that it can be fetched by using that index. and later storing it. If its possible ,Can we store file directly without any indexing.I just want to store the file into ES.

Yes you can. Just use the binary data type.

Where do I need to mention as binary/
For example I am creating pipeline by using the code below.Where exactly should i make it as binary

	String source =
		    "{\"description\":\"my set of processors\"," +
		        "\"processors\":[{\"set\":{\"field\":\"data\",\"value\":\"encodedfile\"}}]}";
		PutPipelineRequest request = new PutPipelineRequest(
		    "ourpipeline", 
		    new BytesArray(source.getBytes(StandardCharsets.UTF_8)), 
		    XContentType.JSON 
		);
		AcknowledgedResponse response = client().ingest().putPipeline(request, RequestOptions.DEFAULT); 
		

In the mapping. See https://www.elastic.co/guide/en/elasticsearch/reference/current/binary.html

If you just want to store the binary file as BASE64, you don't need a pipeline. So you don't need the ingest attachment plugin.

Okay but all those put requests are performed via postman or kibana to store the data.What if we want to use the above approach of binary data type through java code?

And also I have gone through the binary data type.It is clearly mentioned that it cannot be searched or stored by default.So what if I want to fetch my file which i stored earlier?What should I do at that time?

1 Like

It is clearly mentioned that it cannot be searched or stored by default.

It can not be searched as it is not indexed. You told me that you don't want to index it but just store it. So we are good.

The field is not stored by default. It is a good default. As the binary is also stored within the _source field.

PUT my_index
{
  "mappings": {
    "properties": {
      "name": {
        "type": "text"
      },
      "blob": {
        "type": "binary"
      }
    }
  }
}

PUT my_index/_doc/1
{
  "name": "Some binary blob",
  "blob": "U29tZSBiaW5hcnkgYmxvYg==" 
}
GET my_index/_doc/1

This will give you back something like:

{
  // ...
  "_source": {
    "name": "Some binary blob",
    "blob": "U29tZSBiaW5hcnkgYmxvYg==" 
  }
}

What if we want to use the above approach of binary data type through java code?

You need to create the mapping accordingly. See Update mapping API | Java REST Client [7.17] | Elastic or Create Index API | Java REST Client [7.17] | Elastic

Then you need to use the Index API as you already did:

@PostMapping("/upload")
public String upload() {
	String filePath = "C://x.pdf";
	String encodedfile = null;
	RestHighLevelClient restHighLevelClient = null;
	File file = new File(filePath);
	try {
	    FileInputStream fileInputStreamReader = new FileInputStream(file);
	    byte[] bytes = new byte[(int) file.length()];
	    fileInputStreamReader.read(bytes);
	    encodedfile = new String(Base64.getEncoder().encodeToString(bytes));
	} catch (IOException e) {
	}
	try {
	    restHighLevelClient = new RestHighLevelClient(RestClient.builder(new HttpHost("localhost", 9200, "http")));
	} catch (Exception e) {
	}
		
	Map<String, Object> jsonMap = new HashMap<>();
	jsonMap.put("Name", "samanvi");
	jsonMap.put("postDate", new Date());
	jsonMap.put("hra", encodedfile);
 	IndexRequest request = new IndexRequest("index","_doc","56")
			.index("index")
			.source("field",jsonMap);
	try {
	    IndexResponse response = restHighLevelClient.index(request, RequestOptions.DEFAULT);
	} catch(ElasticsearchException | IOException e) {
	}
	return "uploaded";
}

Thanks for ur time.I'll give it a try and get back

Hi David,
I have tried implementing the above approach.Index is succesfully getting created but i am unable to pass the encoded string via put mapping


@GetMapping("/final")
public String trail() throws IOException
{
	String filePath="C:\\Users\\xyz.pdf";
	String encodedfile = null;
	
	File file = new File(filePath);
	try {
	    FileInputStream fileInputStreamReader = new FileInputStream(file);
	    byte[] bytes = new byte[(int) file.length()];
	    fileInputStreamReader.read(bytes);
	    encodedfile = new String(Base64.getEncoder().encodeToString(bytes));
	} catch (IOException e) {
	    e.printStackTrace();
	}
	
	IndexRequest request1 = new IndexRequest("twitter","_doc","56"); 

	request1.source(
	        "{\n" +
	        "  \"properties\": {\n" +
	        "    \"message\": {\n" +
	        "      \"type\": \"binary\"\n" +
	        "    }\n" +
	        "  }\n" +
	        "}", 
	        XContentType.JSON);
	IndexResponse createIndexResponse = client().index(request1, RequestOptions.DEFAULT);
	System.out.print(createIndexResponse);
	
 
    
     PutMappingRequest request3 = new PutMappingRequest("twitter");
     request3.source("{\n" + " \"message\": encodedfile " + "}",
     XContentType.JSON);
     AcknowledgedResponse putMappingResponse1 = client().indices().putMapping(request3, RequestOptions.DEFAULT);

	return "done"; 
	
}

I am getting the below error:

2020-03-13 15:21:02.627 ERROR 1660 --- [nio-8085-exec-2] o.a.c.c.C.[.[.[/].[dispatcherServlet]    : 
Servlet.service() for servlet [dispatcherServlet] in context with path [] threw exception [Request processing failed; nested exception is ElasticsearchStatusException[Elasticsearch exception [type=parse_exception, reason=Failed to parse content to map]]; nested: ElasticsearchException[Elasticsearch exception [type=json_parse_exception, reason=Unrecognized token 'encodedfile': was expecting ('true', 'false' or 'null')
 at [Source: org.elasticsearch.transport.netty4.ByteBufStreamInput@21a5b7d2; line: 2, column: 25]]];] with root cause

Can you please help

You are mixing the index document request and the put mapping request.

This should be in the put mapping request:

        "{\n" +
        "  \"properties\": {\n" +
        "    \"message\": {\n" +
        "      \"type\": \"binary\"\n" +
        "    }\n" +
        "  }\n" +
        "}", 

And this should be the index request:

"{\n" + " \"message\": encodedfile " + "}"

Note that you first create the index, the mapping and then you can index documents.

Thanku so much. It is working after changing as per ur instructions.
Here is what I did

@GetMapping("/final")
public String trail() throws IOException
{
	String filePath="C:\\Users\\w8ben.pdf";
	String encodedfile = null;
	
	File file = new File(filePath);
	try {
	    FileInputStream fileInputStreamReader = new FileInputStream(file);
	    byte[] bytes = new byte[(int) file.length()];
	    fileInputStreamReader.read(bytes);
	    encodedfile = new String(Base64.getEncoder().encodeToString(bytes));
	} catch (IOException e) {
	    e.printStackTrace();
	}
	
	System.out.println(encodedfile);
	//creating index
	CreateIndexRequest request = new CreateIndexRequest("twitter");
	CreateIndexResponse createIndexResponse = client().indices().create(request, RequestOptions.DEFAULT);
    System.out.print(createIndexResponse);
    
	
  
    //mapping properties
    PutMappingRequest request2 = new PutMappingRequest("twitter");
    
    request2.source(
	        "{\n" +
	        "  \"properties\": {\n" +
	        "    \"message\": {\n" +
	        "      \"type\": \"binary\"\n" +
	        "    }\n" +
	        "  }\n" +
	        "}", 
	        XContentType.JSON);
    AcknowledgedResponse putMappingResponse = client().indices().putMapping(request2, RequestOptions.DEFAULT);
   

	IndexRequest request3 = new IndexRequest("twitter","_doc","56"); 
	 request3.source("{\n" + "  \"message\": " +"\""+ encodedfile +"\""+ "}", XContentType.JSON);
	
	IndexResponse createIndexResponse1= client().index(request3, RequestOptions.DEFAULT);
    
	return "done"; 
	
}