Yes I have gone through that documentation. In that I came across the creation of pipeline using PUT _ingest/pipeline/attachment. But whatever we are creating or inserting is done via kibana or postman.But I would like to create pipeline via my java code.
I also have one more doubt.As u mentioned ingest plugin is used to extract the info from pdf, but here my main concern is to store the files into elastic search
The documentation is here: Create or update pipeline API | Java REST Client [7.17] | Elastic
Well. I'm always hesitating doing that. Elasticsearch is not ideal to store big binary blobs. So I'd say that it can be ok if the size of your documents is limited to say 10kb.
If you are planning to store mega bytes of documents, that's another story.
It comes with a lot of costs. Like the network bandwidth when you are going to fetch the 10 top relevant results. 1mb * 10 equals 10mb for each search response...
You need to think about it...
I prefer storing the binary in another datastore or on a file system and just index in Elasticsearch the extracted content and the metadata.
Okay thanku so much for ur time
Our file size would be max 130kb. So can we store them by using ingest plugin?
Sounds like possible to me.
but note that there is no relationship between "storing" the binary in elasticsearch and indexing its content in elasticsearch.
Yon can do:
- index content (with the ingest attachment plugin)
- store the file
- index content (with the plugin) and store the content
So it depends on the use case I'd say.
I dont want to index the content. I would like to index the file. Can we store direct file into ES?
What does mean "index the file"?
Indexing and storing are different things.
Giving some index to the file so that it can be fetched by using that index. and later storing it. If its possible ,Can we store file directly without any indexing.I just want to store the file into ES.
Yes you can. Just use the binary data type.
Where do I need to mention as binary/
For example I am creating pipeline by using the code below.Where exactly should i make it as binary
String source =
"{\"description\":\"my set of processors\"," +
"\"processors\":[{\"set\":{\"field\":\"data\",\"value\":\"encodedfile\"}}]}";
PutPipelineRequest request = new PutPipelineRequest(
"ourpipeline",
new BytesArray(source.getBytes(StandardCharsets.UTF_8)),
XContentType.JSON
);
AcknowledgedResponse response = client().ingest().putPipeline(request, RequestOptions.DEFAULT);
In the mapping. See https://www.elastic.co/guide/en/elasticsearch/reference/current/binary.html
If you just want to store the binary file as BASE64, you don't need a pipeline. So you don't need the ingest attachment plugin.
Okay but all those put requests are performed via postman or kibana to store the data.What if we want to use the above approach of binary data type through java code?
And also I have gone through the binary data type.It is clearly mentioned that it cannot be searched or stored by default.So what if I want to fetch my file which i stored earlier?What should I do at that time?
It is clearly mentioned that it cannot be searched or stored by default.
It can not be searched as it is not indexed. You told me that you don't want to index it but just store it. So we are good.
The field is not stored by default. It is a good default. As the binary is also stored within the _source
field.
PUT my_index
{
"mappings": {
"properties": {
"name": {
"type": "text"
},
"blob": {
"type": "binary"
}
}
}
}
PUT my_index/_doc/1
{
"name": "Some binary blob",
"blob": "U29tZSBiaW5hcnkgYmxvYg=="
}
GET my_index/_doc/1
This will give you back something like:
{
// ...
"_source": {
"name": "Some binary blob",
"blob": "U29tZSBiaW5hcnkgYmxvYg=="
}
}
What if we want to use the above approach of binary data type through java code?
You need to create the mapping accordingly. See Update mapping API | Java REST Client [7.17] | Elastic or Create Index API | Java REST Client [7.17] | Elastic
Then you need to use the Index API as you already did:
@PostMapping("/upload")
public String upload() {
String filePath = "C://x.pdf";
String encodedfile = null;
RestHighLevelClient restHighLevelClient = null;
File file = new File(filePath);
try {
FileInputStream fileInputStreamReader = new FileInputStream(file);
byte[] bytes = new byte[(int) file.length()];
fileInputStreamReader.read(bytes);
encodedfile = new String(Base64.getEncoder().encodeToString(bytes));
} catch (IOException e) {
}
try {
restHighLevelClient = new RestHighLevelClient(RestClient.builder(new HttpHost("localhost", 9200, "http")));
} catch (Exception e) {
}
Map<String, Object> jsonMap = new HashMap<>();
jsonMap.put("Name", "samanvi");
jsonMap.put("postDate", new Date());
jsonMap.put("hra", encodedfile);
IndexRequest request = new IndexRequest("index","_doc","56")
.index("index")
.source("field",jsonMap);
try {
IndexResponse response = restHighLevelClient.index(request, RequestOptions.DEFAULT);
} catch(ElasticsearchException | IOException e) {
}
return "uploaded";
}
Thanks for ur time.I'll give it a try and get back
Hi David,
I have tried implementing the above approach.Index is succesfully getting created but i am unable to pass the encoded string via put mapping
@GetMapping("/final")
public String trail() throws IOException
{
String filePath="C:\\Users\\xyz.pdf";
String encodedfile = null;
File file = new File(filePath);
try {
FileInputStream fileInputStreamReader = new FileInputStream(file);
byte[] bytes = new byte[(int) file.length()];
fileInputStreamReader.read(bytes);
encodedfile = new String(Base64.getEncoder().encodeToString(bytes));
} catch (IOException e) {
e.printStackTrace();
}
IndexRequest request1 = new IndexRequest("twitter","_doc","56");
request1.source(
"{\n" +
" \"properties\": {\n" +
" \"message\": {\n" +
" \"type\": \"binary\"\n" +
" }\n" +
" }\n" +
"}",
XContentType.JSON);
IndexResponse createIndexResponse = client().index(request1, RequestOptions.DEFAULT);
System.out.print(createIndexResponse);
PutMappingRequest request3 = new PutMappingRequest("twitter");
request3.source("{\n" + " \"message\": encodedfile " + "}",
XContentType.JSON);
AcknowledgedResponse putMappingResponse1 = client().indices().putMapping(request3, RequestOptions.DEFAULT);
return "done";
}
I am getting the below error:
2020-03-13 15:21:02.627 ERROR 1660 --- [nio-8085-exec-2] o.a.c.c.C.[.[.[/].[dispatcherServlet] :
Servlet.service() for servlet [dispatcherServlet] in context with path [] threw exception [Request processing failed; nested exception is ElasticsearchStatusException[Elasticsearch exception [type=parse_exception, reason=Failed to parse content to map]]; nested: ElasticsearchException[Elasticsearch exception [type=json_parse_exception, reason=Unrecognized token 'encodedfile': was expecting ('true', 'false' or 'null')
at [Source: org.elasticsearch.transport.netty4.ByteBufStreamInput@21a5b7d2; line: 2, column: 25]]];] with root cause
Can you please help
You are mixing the index document request and the put mapping request.
This should be in the put mapping request:
"{\n" +
" \"properties\": {\n" +
" \"message\": {\n" +
" \"type\": \"binary\"\n" +
" }\n" +
" }\n" +
"}",
And this should be the index request:
"{\n" + " \"message\": encodedfile " + "}"
Note that you first create the index, the mapping and then you can index documents.
Thanku so much. It is working after changing as per ur instructions.
Here is what I did
@GetMapping("/final")
public String trail() throws IOException
{
String filePath="C:\\Users\\w8ben.pdf";
String encodedfile = null;
File file = new File(filePath);
try {
FileInputStream fileInputStreamReader = new FileInputStream(file);
byte[] bytes = new byte[(int) file.length()];
fileInputStreamReader.read(bytes);
encodedfile = new String(Base64.getEncoder().encodeToString(bytes));
} catch (IOException e) {
e.printStackTrace();
}
System.out.println(encodedfile);
//creating index
CreateIndexRequest request = new CreateIndexRequest("twitter");
CreateIndexResponse createIndexResponse = client().indices().create(request, RequestOptions.DEFAULT);
System.out.print(createIndexResponse);
//mapping properties
PutMappingRequest request2 = new PutMappingRequest("twitter");
request2.source(
"{\n" +
" \"properties\": {\n" +
" \"message\": {\n" +
" \"type\": \"binary\"\n" +
" }\n" +
" }\n" +
"}",
XContentType.JSON);
AcknowledgedResponse putMappingResponse = client().indices().putMapping(request2, RequestOptions.DEFAULT);
IndexRequest request3 = new IndexRequest("twitter","_doc","56");
request3.source("{\n" + " \"message\": " +"\""+ encodedfile +"\""+ "}", XContentType.JSON);
IndexResponse createIndexResponse1= client().index(request3, RequestOptions.DEFAULT);
return "done";
}