Hi, aproximately 100 GB per day, but in this situation I will need:
30 x 100 GB = 3 TB per month
and 9 TB for a 3-month-period
So we need to buy a new storage a think.
What is yoor situation? Do you work with raw data or do you try to get only relevent fields using a Logstash pipeline or something like this, in order to reduce the data stored?
You can opt for a Hot/Warm/Cold architecture
Let's say following config (This is including 1 replicas) :
7 days hot data : 2 nodes with 2Tb SSD Storage
30 days warm data : 2 nodes with 8Tb SAS Storage
60 days cold data : 2 nodes with 16Tb HDD Storage (you can have 0 replica here)
You can refer to this presentation on how to size your cluster
Raw data converted to JSON will double the volume => 200 Gb/day of json
If you clean your data with logstash and you customize your mapping then indexing process will reduce the json size by a factor of 0.5 may be ... then you can say that you will index 100Gb/day of data, considering replicas will double the disk size
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.