Advantages of base64 encoded content in ingest attachment plugin

Hemanth_Gowda · April 2, 2018, 5:22pm

Hi,

I am using ingest attachment plugin to index my PDF files.In the query during ingestion the source field is expecting base64 content.Below are my questions.
Why we need to pass base64 content as source to plugin.However plugin converts the base 64 to actual content. In that case we can directly index the actual content instead of converting to base 64 and pass it to ingest plugin.
What are the advantages of encoding the source to base64?

shanec · April 3, 2018, 2:10am

PDFs are basically just a big binary block of data, not text. Whenever you send arbitrary binary data to a server (e.g. Elasticsearch), you either have to encode the data in a way the server understands (e.g. Elasticsearch understands JSON, so getting the data into something that's JSON compatible) or teach the server how to accept arbitrary binary data (e.g. implement accepting multipart/form-data in the server). Here we've chosen base64 encoding. There are advantages like reducing complexity on the server by doing this vs accepting arbitrary binary uploads and the "cost" of doing this conversion isn't terribly high (4/3 the upload size vs the raw data and there are encoders/decoders for base64 in virtually every programming language). The plugin uses Tika to actually extract the text out, and it's also perfectly fine to run Tika on the files in your own process and just send the text/metadata across to Elasticsearch instead of the binary data.

dadoonet · April 3, 2018, 2:44am

Have a look at FSCrawler project. It does that. Some code here: https://github.com/dadoonet/fscrawler/tree/master/tika/src/main/java/fr/pilato/elasticsearch/crawler/fs/tika

system · May 1, 2018, 2:44am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Best way to use Ingest Attachment Plugin Elasticsearch ingest-pipeline	4	514	December 31, 2021
How to specify file to Ingest Attachment Elasticsearch	11	4790	March 21, 2017
Ingest attachment/mapper attachments plugin _source issue Elasticsearch	4	566	January 16, 2017
Mapper attachment plugin vs. pre-parsing and extracting content from binary files Elasticsearch	12	1643	March 6, 2017
Don't return whole BASE64 encoded files (ingest plugin) Elasticsearch	2	393	April 2, 2019

Advantages of base64 encoded content in ingest attachment plugin

Related topics