This is how our current cluster setup on VMWARE:
3 master nodes: 8G RAM each, plus
2 data nodes: 8G RAM/ 500GB disk space, each on different physical disk for redundancy.
Recently we just started collecting NetFlow data with ElastiFlow, the data amount about is 25GB per day, with one
replication that's about 50GB disk space per day. The data retention period will be at least five days.
To accommodate the increasing data flow and amount, we are considering two scaling options here:
Add two more data nodes (8GB RAM, 500GB disk) to the cluster, and make sure each storage are on different physical disk. (Thanks for the shard allocation awareness).
The cluster will be 3 master nodes(8GB RAM) plus 4 data nodes(8GB RAM, 500GB disk each).
Add 8GB RAM, 500GB to each two existing data nodes.
The cluster will be 3 master nodes(8GB RAM) plus 2 data nodes(16GB RAM, 1TB disk each).
What are the pros and cons for each option? Any hints?
If we make the assumption that CPU resources also increase the same amount in both cases I would not necessarily expect much difference in performance. If anything I would expect 2 larger nodes to potentially perform better as there is less additional network traffic.
The main difference is probably what happens in failure scenarios. If you have 2 nodes and 1 replica, both nodes will hold all data. If a node fails Elasticsearch have nowhere to relocate the missing shards to and you will be running with just primary shards until the node comes back.
If you have 4 nodes Elasticsearch can and will recover missing shards on the remaining nodes should a node fail. Exactly how these recovered shards are spread out across the nodes depend on whether shard allocation awareness is used or not. In this case you will still have replica shards of at least some of your shards which increases reliability, but Elasticsearch will also try to put all shards on just 3 disks in this scenario which could cause problems if you are using most of your disk space.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.