Would Elasticsearch be suitable as a long-term storage system (besides being a querying system) for short to mid-term offline batch analytics using Apache Spark ? We’re talking petabyte-scale retention over a year with terabytes of new incoming data fed into a Kafka cluster and routed to Logstash, then ES. I’m worried that ES’s overhead would make a huge difference in terms of storage space usage against alternative solutions like compressed data on HDFS/HBASE. In other terms, is there a similar consistent, automatic management system to « archive » older data on an ES cluster ?
Hello, it's really depends. You should make similar estimation to understand is it fits for your purposes (do you need to index data, do you need doc_values, do you need replication, is best_compression codec suitable and so on). IMHO for now ES is too memory hungry for petabyte-scale solutions.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.