I was going through 2 good posts to determine the size of ES servers in production. Here and here
Do we need to account for extra size (free) on disk ? So, if all of my indices occupy, say 3 TB (with replication) and I want to start 3 nodes with 1 TB size on each node? Will that work?
Or do I need to keep 50% space unused for any internal elasticsearch operations? i.e, 1.5 TB on each node?
As Elasticsearch writes to immutable segments during indexing, which then are merged, you will need some free disk space to account for this. Exactly how much will depend on the workload. You should also consider what would happen if you lost one node. In this case, assuming that you have a replication factor of 1, Elasticsearch will want to allocate the shards on the missing node to the other nodes in the cluster, which in your case could result in about 1.5TB of data per node. If there is not enough room, replica shards will remain unassigned, which may be perfectly acceptable for your use case.
So, if I have 3 X 1.5 TB nodes, which have 500GB free disk space for these merge operations and one node failure considerations.
If 1 node goes down in above cluster, 1 TB of index data needs to be distributed to 2 other nodes to ensure replication (1 replica). So, the free 500GB on each of these nodes is filled up, and they dont have sufficient space of merge operations until a new node is added. Am I right ?
So, it is better to keep, say 50% of free disk space, in this case? This makes each node 2 TB.
My load is something close to 10 GB/day (without replication)
Having more disk space will also gives you room to grow, but regarding the relocation of shards on node failure it comes down to how long you need to handle a node being down and whether it is acceptable or not to not have all replica shards assigned during this time.
I have one more question, and this is about how master election is done if I have 3 nodes(2master+data and 1 dedicated master).
I have gone through this post, where you have given some details about using such configuration(3 nodes startup) for small clusters. With 10GB/day and holding 3 TB of index data, I believe my cluster is also a small one. I have also gone through master-election documentation here.
Since I don't want my data nodes to be doing extra work, I would like to have my only dedicated master node to be the actual master. How can I make sure that happens during Master-Election? Or is this the default behavior?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.