I am using mapper attachment plugin and using ES 2.x. When I index an attachment(say pdf, or docx), the base 64 encoding of the attachment gets stored in the field which is mapped to type attachment and the content can be accessed via field.content. Also since source is enabled the base64 encoding of the the pdf gets stored in ES document making the document very large/ consuming more disk space. I can avoid this issue by excluding this(attachment field) using the settings "excludes": [] while defining the index setting.
However the issue i face is now if i want to update this document and add few additional attributes, since _source is disabled for attachment type of fields, i loose the the attachment that was indexed originally.
My question is is there a way to just to retain the attachment content ( excluding the base64 encodings) for future updates on the ES document.
That's why we advice moving to elasticsearch 5.0 and use ingest-attachment plugin instead.
This one will modify the _source document with the extracted text content of your binary files.
So updates will work OOTB.
Just to clarify my understanding, in ingest-attachment, only extracted content will be stored and for further updates to the document(updating other fields of the document) there is no need to send the binary attachment again.
As we are having ES 2.3 in production and migrating to ES 5.0 requires company approvals etc so it would take a while.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.