ES as a long-term storage system inside analytics architecture?

Would Elasticsearch be suitable as a long-term storage system (besides being a querying system) for short to mid-term offline batch analytics using Apache Spark ? We’re talking petabyte-scale retention over a year with terabytes of new incoming data fed into a Kafka cluster and routed to Logstash, then ES. I’m worried that ES’s overhead would make a huge difference in terms of storage space usage against alternative solutions like compressed data on HDFS/HBASE. In other terms, is there a similar consistent, automatic management system to « archive » older data on an ES cluster ?

Thanks in advance. :innocent:

So I guess the answer is : NO?

Hello, it's really depends. You should make similar estimation to understand is it fits for your purposes (do you need to index data, do you need doc_values, do you need replication, is best_compression codec suitable and so on). IMHO for now ES is too memory hungry for petabyte-scale solutions.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.