I am using ingest attachment plugin to index my PDF files.In the query during ingestion the source field is expecting base64 content.Below are my questions.
Why we need to pass base64 content as source to plugin.However plugin converts the base 64 to actual content. In that case we can directly index the actual content instead of converting to base 64 and pass it to ingest plugin.
What are the advantages of encoding the source to base64?
PDFs are basically just a big binary block of data, not text. Whenever you send arbitrary binary data to a server (e.g. Elasticsearch), you either have to encode the data in a way the server understands (e.g. Elasticsearch understands JSON, so getting the data into something that's JSON compatible) or teach the server how to accept arbitrary binary data (e.g. implement accepting multipart/form-data in the server). Here we've chosen base64 encoding. There are advantages like reducing complexity on the server by doing this vs accepting arbitrary binary uploads and the "cost" of doing this conversion isn't terribly high (4/3 the upload size vs the raw data and there are encoders/decoders for base64 in virtually every programming language). The plugin uses Tika to actually extract the text out, and it's also perfectly fine to run Tika on the files in your own process and just send the text/metadata across to Elasticsearch instead of the binary data.
Have a look at FSCrawler project. It does that. Some code here: https://github.com/dadoonet/fscrawler/tree/master/tika/src/main/java/fr/pilato/elasticsearch/crawler/fs/tika
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.