What is the curl command to convert pdf into base64 format?

pyerunka · March 11, 2019, 11:59am

I want to convert my pdf to base64.
i am using below code but it is giving me an error:

curl -XPOST "http://localhost:9200/test/xmlfile?pretty=1" -d '
{
"attachment" : "' base64 /path/filename | perl -pe 's/\n/\\n/g' '"
}'

xavierfacq · March 11, 2019, 3:45pm

Hi,

NOTE: Please say "Hi / Hello", "Thank you" in your message to optimize your chance of response.

What do you want to do exactly ? The code you are given seems to be a Linux command. You
can run it into a Shell, but it's not a Json command, isn't it ?

bye
Xavier

dadoonet · March 12, 2019, 11:05am

You need to transform your file binary to BASE64 before sending it to elasticsearch.
This is to be made before calling elasticsearch.

You can do that with

Or you can do that using some linux commands like base64.
Or by writing some code like here:

github.com

dadoonet/fscrawler/blob/361495ddd03e3c40067157c1b92a454ca6247878/tika/src/main/java/fr/pilato/elasticsearch/crawler/fs/tika/TikaDocParser.java#L204


        doc.setContent(parsedContent);
    } else if (fsSettings.getFs().isStoreSource()) {
        // We don't extract content but just store the binary file
        // We need to create the ByteArrayOutputStream which has not been created then
        copy(inputStream, bos);
    }


    // Doc as binary attachment
    if (fsSettings.getFs().isStoreSource()) {
        //noinspection ConstantConditions
        doc.setAttachment(Base64.getEncoder().encodeToString(bos.toByteArray()));
    }
    logger.trace("End document generation");
    // End of our document
}


private static <T> void setMeta(String filename, Metadata metadata, Property property, Consumer<T> setter, Function<String,T> transformer) {
    String sMeta = metadata.get(property);
    try {
        setter.accept(transformer.apply(sMeta));
    } catch (Exception e) {

You can also have a look at FSCrawler project. It has an upload endpoint where you can directly upload your binary document to elasticsearch. See https://fscrawler.readthedocs.io/en/latest/admin/fs/rest.html#uploading-a-binary-document

pyerunka · March 12, 2019, 11:47am

Hi,

Thanks for your reply!!!!
I have written JavaScript code to transform .pdf file to BASE64. I am getting value for DATA field that needs to be passed. but can i pass more than one data to ES? So currently i am indexing only one pdf document. I want to index more than one pdf document. so how can i pass it using below code?

PUT my_index/_doc/my_id?pipeline=attachment
{
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}

Thanks,
Priyanka

dadoonet · March 12, 2019, 12:12pm

Why do you want to index documents together and not individually? Are they related?

To answer your question, you can define multiple attachment processors within the same pipeline.

pyerunka · March 13, 2019, 4:48am

Hi,

Thanks for your reply!!!
Yes, I want to indexed documents together. Because it is our business requirement.
after creating pipeline and passing data value to new index,when you create index pattern and discover it, you can see one pdf file that is indexed. I want more pdf indexed record under one index pattern. And want to search it.

Thanks,
Priyanka

pyerunka · March 13, 2019, 6:56am

Hello @dadoonet,

Thanks for your help!!!
As per reply i tried for multiple attachment processors within the same pipeline. It is indexing documents together. But when i create index pattern and discover it, it is giving me one single record even if i have indexed 3 documents in one pipeline. If i have indexed 3 documents, i will be getting 3 different records. Correct me if i am wrong.

Thanks,
Priyanka

dadoonet · March 13, 2019, 6:36pm

So you won't get back one document when you search but an array of documents? Meaning that the user will have to guess in which document the text has been found.

Is that what you really want?

pyerunka · March 14, 2019, 3:30am

Hello,

Yes, like google search. if user searches any word from attachment file, then it should give in which document the text has been found.

Thanks,
Priyanka

dadoonet · March 14, 2019, 8:59am

This won't be possible if you index an array of attachments. You need to index attachments individually.

pyerunka · March 14, 2019, 10:45am

Hi @dadoonet ,

Thanks for your reply!!!!

If I indexed attachments individually every time, I have to create a new index. I want all the indexed attachments in one index only. So that I can see all the documents as a separate record and search through it.

Thanks,
Priyanka

dadoonet · March 14, 2019, 11:59am

No. All documents will go to the same index.

pyerunka · March 15, 2019, 5:05am

Hi @dadoonet,

Thanks for reply!!!
Could you please suggest me how I can indexed multiple documents with same index as I cannot use multiple attachment processors?

Thanks,
Priyanka

dadoonet · March 15, 2019, 7:11am

Like this:

PUT my_index/_doc/1?pipeline=attachment
{
  "data": "BASE64-doc1"
}
PUT my_index/_doc/2?pipeline=attachment
{
  "data": "BASE64-doc2"
}

pyerunka · March 15, 2019, 8:40am

Hi @dadoonet,

Thanks for your quick help!!!!!

This solves my problem.

Thanks,
Priyanka

system · April 12, 2019, 8:40am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Error putting base64 converted string into Elasticsearch Elasticsearch	4	3065	July 5, 2017
Convert the file into base64 in elasticsearch for attachment Elasticsearch	9	3799	July 5, 2017
Howto Decode64 attachment Elasticsearch	4	1163	July 6, 2017
How to use FSCrawler to send elasticsearch Base64 encoded PDF? Elasticsearch	5	875	May 17, 2018
BulkFile Indexing with base64 command utility on linux server Elasticsearch	4	472	July 6, 2017

What is the curl command to convert pdf into base64 format?

Related topics