I have some terabytes of documents (pdf, office, etc) stored in some system
outside of ES. Suppose I want to make them searchable with ES, however I
will never serve the original documents from ES, but from that other system.
Is it possible to send the documents to ES (e.g. via base64 encoded field
and the attachment type mapping), have ES index them and afterwards delete
that base64 field so that the "real content" of my documents is not stored
in ES (for cost reasons)?
Queries will then be served by ES but the real document is served by that
other system I have.
I have some terabytes of documents (pdf, office, etc) stored in some system outside of ES. Suppose I want to make them searchable with ES, however I will never serve the original documents from ES, but from that other system.
Is it possible to send the documents to ES (e.g. via base64 encoded field and the attachment type mapping), have ES index them and afterwards delete that base64 field so that the "real content" of my documents is not stored in ES (for cost reasons)?
Queries will then be served by ES but the real document is served by that other system I have.
Thanks a lot, that sounds exactly like what I was looking for!
Why would you suggest extracting the content myself? Because of the
"experimental" state of the attachment type plugin?
Even if I'd extract the content myself I wouldn't want to store it in ES
(as I'd never request it from ES). The only benefit I could think of is the
ability to reindex inside ES without having my outer system to feed the
content in again for reindexing.
On Thursday, February 12, 2015 at 11:02:51 PM UTC+1, David Pilato wrote:
I have some terabytes of documents (pdf, office, etc) stored in some
system outside of ES. Suppose I want to make them searchable with ES,
however I will never serve the original documents from ES, but from that
other system.
Is it possible to send the documents to ES (e.g. via base64 encoded field
and the attachment type mapping), have ES index them and afterwards delete
that base64 field so that the "real content" of my documents is not stored
in ES (for cost reasons)?
Queries will then be served by ES but the real document is served by that
other system I have.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.