How to index and store pdf file in elastic search using spring boot?

samanvi_sweety · March 10, 2020, 8:36am

Yes I have gone through that documentation. In that I came across the creation of pipeline using PUT _ingest/pipeline/attachment. But whatever we are creating or inserting is done via kibana or postman.But I would like to create pipeline via my java code.

samanvi_sweety · March 10, 2020, 8:42am

I also have one more doubt.As u mentioned ingest plugin is used to extract the info from pdf, but here my main concern is to store the files into elastic search

dadoonet · March 10, 2020, 8:59am

The documentation is here: Create or update pipeline API | Java REST Client [7.17] | Elastic

Well. I'm always hesitating doing that. Elasticsearch is not ideal to store big binary blobs. So I'd say that it can be ok if the size of your documents is limited to say 10kb.
If you are planning to store mega bytes of documents, that's another story.

It comes with a lot of costs. Like the network bandwidth when you are going to fetch the 10 top relevant results. 1mb * 10 equals 10mb for each search response...
You need to think about it...

I prefer storing the binary in another datastore or on a file system and just index in Elasticsearch the extracted content and the metadata.

samanvi_sweety · March 10, 2020, 9:12am

Okay thanku so much for ur time

samanvi_sweety · March 10, 2020, 10:14am

Our file size would be max 130kb. So can we store them by using ingest plugin?

dadoonet · March 10, 2020, 11:06am

Sounds like possible to me.

but note that there is no relationship between "storing" the binary in elasticsearch and indexing its content in elasticsearch.

Yon can do:

index content (with the ingest attachment plugin)
store the file
index content (with the plugin) and store the content

So it depends on the use case I'd say.

samanvi_sweety · March 11, 2020, 5:38am

I dont want to index the content. I would like to index the file. Can we store direct file into ES?

dadoonet · March 11, 2020, 5:54am

What does mean "index the file"?

Indexing and storing are different things.

samanvi_sweety · March 11, 2020, 5:57am

Giving some index to the file so that it can be fetched by using that index. and later storing it. If its possible ,Can we store file directly without any indexing.I just want to store the file into ES.

dadoonet · March 11, 2020, 5:58am

Yes you can. Just use the binary data type.

samanvi_sweety · March 11, 2020, 6:09am

Where do I need to mention as binary/
For example I am creating pipeline by using the code below.Where exactly should i make it as binary

	String source =
		    "{\"description\":\"my set of processors\"," +
		        "\"processors\":[{\"set\":{\"field\":\"data\",\"value\":\"encodedfile\"}}]}";
		PutPipelineRequest request = new PutPipelineRequest(
		    "ourpipeline", 
		    new BytesArray(source.getBytes(StandardCharsets.UTF_8)), 
		    XContentType.JSON 
		);
		AcknowledgedResponse response = client().ingest().putPipeline(request, RequestOptions.DEFAULT);

dadoonet · March 11, 2020, 6:26am

In the mapping. See https://www.elastic.co/guide/en/elasticsearch/reference/current/binary.html

If you just want to store the binary file as BASE64, you don't need a pipeline. So you don't need the ingest attachment plugin.

samanvi_sweety · March 11, 2020, 6:39am

Okay but all those put requests are performed via postman or kibana to store the data.What if we want to use the above approach of binary data type through java code?

samanvi_sweety · March 11, 2020, 9:19am

And also I have gone through the binary data type.It is clearly mentioned that it cannot be searched or stored by default.So what if I want to fetch my file which i stored earlier?What should I do at that time?

dadoonet · March 11, 2020, 10:27am

It is clearly mentioned that it cannot be searched or stored by default.

It can not be searched as it is not indexed. You told me that you don't want to index it but just store it. So we are good.

The field is not stored by default. It is a good default. As the binary is also stored within the _source field.

PUT my_index
{
  "mappings": {
    "properties": {
      "name": {
        "type": "text"
      },
      "blob": {
        "type": "binary"
      }
    }
  }
}

PUT my_index/_doc/1
{
  "name": "Some binary blob",
  "blob": "U29tZSBiaW5hcnkgYmxvYg==" 
}
GET my_index/_doc/1

This will give you back something like:

{
  // ...
  "_source": {
    "name": "Some binary blob",
    "blob": "U29tZSBiaW5hcnkgYmxvYg==" 
  }
}

dadoonet · March 11, 2020, 10:33am

What if we want to use the above approach of binary data type through java code?

You need to create the mapping accordingly. See Update mapping API | Java REST Client [7.17] | Elastic or Create Index API | Java REST Client [7.17] | Elastic

Then you need to use the Index API as you already did:

@PostMapping("/upload")
public String upload() {
	String filePath = "C://x.pdf";
	String encodedfile = null;
	RestHighLevelClient restHighLevelClient = null;
	File file = new File(filePath);
	try {
	    FileInputStream fileInputStreamReader = new FileInputStream(file);
	    byte[] bytes = new byte[(int) file.length()];
	    fileInputStreamReader.read(bytes);
	    encodedfile = new String(Base64.getEncoder().encodeToString(bytes));
	} catch (IOException e) {
	}
	try {
	    restHighLevelClient = new RestHighLevelClient(RestClient.builder(new HttpHost("localhost", 9200, "http")));
	} catch (Exception e) {
	}
		
	Map<String, Object> jsonMap = new HashMap<>();
	jsonMap.put("Name", "samanvi");
	jsonMap.put("postDate", new Date());
	jsonMap.put("hra", encodedfile);
 	IndexRequest request = new IndexRequest("index","_doc","56")
			.index("index")
			.source("field",jsonMap);
	try {
	    IndexResponse response = restHighLevelClient.index(request, RequestOptions.DEFAULT);
	} catch(ElasticsearchException | IOException e) {
	}
	return "uploaded";
}

samanvi_sweety · March 11, 2020, 10:46am

Thanks for ur time.I'll give it a try and get back

samanvi_sweety · March 13, 2020, 10:01am

Hi David,
I have tried implementing the above approach.Index is succesfully getting created but i am unable to pass the encoded string via put mapping


@GetMapping("/final")
public String trail() throws IOException
{
	String filePath="C:\\Users\\xyz.pdf";
	String encodedfile = null;
	
	File file = new File(filePath);
	try {
	    FileInputStream fileInputStreamReader = new FileInputStream(file);
	    byte[] bytes = new byte[(int) file.length()];
	    fileInputStreamReader.read(bytes);
	    encodedfile = new String(Base64.getEncoder().encodeToString(bytes));
	} catch (IOException e) {
	    e.printStackTrace();
	}
	
	IndexRequest request1 = new IndexRequest("twitter","_doc","56"); 

	request1.source(
	        "{\n" +
	        "  \"properties\": {\n" +
	        "    \"message\": {\n" +
	        "      \"type\": \"binary\"\n" +
	        "    }\n" +
	        "  }\n" +
	        "}", 
	        XContentType.JSON);
	IndexResponse createIndexResponse = client().index(request1, RequestOptions.DEFAULT);
	System.out.print(createIndexResponse);
	
 
    
     PutMappingRequest request3 = new PutMappingRequest("twitter");
     request3.source("{\n" + " \"message\": encodedfile " + "}",
     XContentType.JSON);
     AcknowledgedResponse putMappingResponse1 = client().indices().putMapping(request3, RequestOptions.DEFAULT);

	return "done"; 
	
}

I am getting the below error:

2020-03-13 15:21:02.627 ERROR 1660 --- [nio-8085-exec-2] o.a.c.c.C.[.[.[/].[dispatcherServlet]    : 
Servlet.service() for servlet [dispatcherServlet] in context with path [] threw exception [Request processing failed; nested exception is ElasticsearchStatusException[Elasticsearch exception [type=parse_exception, reason=Failed to parse content to map]]; nested: ElasticsearchException[Elasticsearch exception [type=json_parse_exception, reason=Unrecognized token 'encodedfile': was expecting ('true', 'false' or 'null')
 at [Source: org.elasticsearch.transport.netty4.ByteBufStreamInput@21a5b7d2; line: 2, column: 25]]];] with root cause

Can you please help

dadoonet · March 13, 2020, 10:53am

You are mixing the index document request and the put mapping request.

This should be in the put mapping request:

        "{\n" +
        "  \"properties\": {\n" +
        "    \"message\": {\n" +
        "      \"type\": \"binary\"\n" +
        "    }\n" +
        "  }\n" +
        "}",

And this should be the index request:

"{\n" + " \"message\": encodedfile " + "}"

Note that you first create the index, the mapping and then you can index documents.

samanvi_sweety · March 16, 2020, 5:53am

Thanku so much. It is working after changing as per ur instructions.
Here is what I did

@GetMapping("/final")
public String trail() throws IOException
{
	String filePath="C:\\Users\\w8ben.pdf";
	String encodedfile = null;
	
	File file = new File(filePath);
	try {
	    FileInputStream fileInputStreamReader = new FileInputStream(file);
	    byte[] bytes = new byte[(int) file.length()];
	    fileInputStreamReader.read(bytes);
	    encodedfile = new String(Base64.getEncoder().encodeToString(bytes));
	} catch (IOException e) {
	    e.printStackTrace();
	}
	
	System.out.println(encodedfile);
	//creating index
	CreateIndexRequest request = new CreateIndexRequest("twitter");
	CreateIndexResponse createIndexResponse = client().indices().create(request, RequestOptions.DEFAULT);
    System.out.print(createIndexResponse);
    
	
  
    //mapping properties
    PutMappingRequest request2 = new PutMappingRequest("twitter");
    
    request2.source(
	        "{\n" +
	        "  \"properties\": {\n" +
	        "    \"message\": {\n" +
	        "      \"type\": \"binary\"\n" +
	        "    }\n" +
	        "  }\n" +
	        "}", 
	        XContentType.JSON);
    AcknowledgedResponse putMappingResponse = client().indices().putMapping(request2, RequestOptions.DEFAULT);
   

	IndexRequest request3 = new IndexRequest("twitter","_doc","56"); 
	 request3.source("{\n" + "  \"message\": " +"\""+ encodedfile +"\""+ "}", XContentType.JSON);
	
	IndexResponse createIndexResponse1= client().index(request3, RequestOptions.DEFAULT);
    
	return "done"; 
	
}

Topic		Replies	Views
How to use ingest attachment plugin to index and store the pdfs in elastic search using spring boot Elasticsearch	2	1322	March 5, 2020
Indexing PDF file in ElasticSearch using Java Code Elasticsearch	2	2598	August 28, 2018
How to index PDF file data and search data from attachment PDF file Elastic Search elastic-app-search	7	7747	March 29, 2021
Search a PDF file using its content Elasticsearch	9	15675	February 11, 2019
Search froma a pdf file content Elasticsearch	9	467	July 23, 2020

How to index and store pdf file in elastic search using spring boot?

Related topics