I was wondering what the maximum recommended disk size/number of disks are per each data node. What would be the consequences of using big data nodes (24TB+) on a heavy indexing cluster?
As described in this blog post, each shard comes with some overhead in terms of heap usage. Exactly how much depends on what type of data you are storing as well as the size of the shards. You also need heap space for indexing and querying data.
Indexing is very I/O intensive and can use a lot of heap, so in order to optimise storage on nodes it is common to implement a hot/warm architecture. This means that a subset of nodes in the cluster (aka hot tier), equipped with fast SSD storage and good amount of CPU, handle all indexing of new data as well as querying of the most recently indexed data. These hold relatively little data as a lot of heap is used for indexing and querying.
A separate set of nodes (aka warm tier), hold indices that are more then a few days old. These are typically not indexed into, which means heap can be dedicated to querying and shard overhead. These nodes typically have large volumes of spinning disks and can hold much more data than the hot nodes. This is where 'big data nodes' are more suitable.
Exactly how much data you can put on a node will depend on what type of data you have, how effectively you can minimize overhead and how much heap you need set aside for querying.
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.