How to determine best practice to deploy ES cluster

Hi everyone we're currently running ES version 7.10. Currently we only has 2 VM for powering this deployment (Master and Data Nodes) and we're seperating our nodes by using different port number on each virtual node.

We want to migrate and upgrade this Elasticstack deployment using new environment which capable of serving new ES nodes. What's is the best practice for doing so?

Here's our cluster stats

{
  "_nodes" : {
    "total" : 6,
    "successful" : 6,
    "failed" : 0
  },
  "cluster_name" : "bdb-elastic",
  "cluster_uuid" : "pOIsDqjgStq3SnzSmY1-Sw",
  "timestamp" : 1711955998378,
  "status" : "green",
  "indices" : {
    "count" : 400,
    "shards" : {
      "total" : 787,
      "primaries" : 400,
      "replication" : 0.9675,
      "index" : {
        "shards" : {
          "min" : 1,
          "max" : 2,
          "avg" : 1.9675
        },
        "primaries" : {
          "min" : 1,
          "max" : 1,
          "avg" : 1.0
        },
        "replication" : {
          "min" : 0.0,
          "max" : 1.0,
          "avg" : 0.9675
        }
      }
    },
    "docs" : {
      "count" : 1931491718,
      "deleted" : 291189
    },
    "store" : {
      "size_in_bytes" : 1941662100888,
      "reserved_in_bytes" : 0
    },
    "fielddata" : {
      "memory_size_in_bytes" : 348344,
      "evictions" : 0
    },
    "query_cache" : {
      "memory_size_in_bytes" : 671542259,
      "total_count" : 602846720,
      "hit_count" : 434186454,
      "miss_count" : 168660266,
      "cache_size" : 41777,
      "cache_count" : 4582209,
      "evictions" : 4540432
    },
    "completion" : {
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 5763,
      "memory_in_bytes" : 188357460,
      "terms_memory_in_bytes" : 88109056,
      "stored_fields_memory_in_bytes" : 3710248,
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory_in_bytes" : 181376,
      "points_memory_in_bytes" : 0,
      "doc_values_memory_in_bytes" : 96356780,
      "index_writer_memory_in_bytes" : 724736788,
      "version_map_memory_in_bytes" : 46568,
      "fixed_bit_set_memory_in_bytes" : 75608,
      "max_unsafe_auto_id_timestamp" : 1711948747448,
      "file_sizes" : { }
    },
    "mappings" : {
      "field_types" : [
        {
          "name" : "binary",
          "count" : 15,
          "index_count" : 4
        },
        {
          "name" : "boolean",
          "count" : 1249,
          "index_count" : 217
        },
        {
          "name" : "byte",
          "count" : 118,
          "index_count" : 118
        },
        {
          "name" : "date",
          "count" : 1794,
          "index_count" : 399
        },
        {
          "name" : "date_nanos",
          "count" : 1,
          "index_count" : 1
        },
        {
          "name" : "date_range",
          "count" : 1,
          "index_count" : 1
        },
        {
          "name" : "double",
          "count" : 1,
          "index_count" : 1
        },
        {
          "name" : "double_range",
          "count" : 1,
          "index_count" : 1
        },
        {
          "name" : "flattened",
          "count" : 9,
          "index_count" : 1
        },
        {
          "name" : "float",
          "count" : 2330,
          "index_count" : 132
        },
        {
          "name" : "float_range",
          "count" : 1,
          "index_count" : 1
        },
        {
          "name" : "geo_point",
          "count" : 551,
          "index_count" : 297
        },
        {
          "name" : "geo_shape",
          "count" : 1,
          "index_count" : 1
        },
        {
          "name" : "half_float",
          "count" : 511,
          "index_count" : 256
        },
        {
          "name" : "histogram",
          "count" : 40,
          "index_count" : 40
        },
        {
          "name" : "integer",
          "count" : 30,
          "index_count" : 4
        },
        {
          "name" : "integer_range",
          "count" : 1,
          "index_count" : 1
        },
        {
          "name" : "ip",
          "count" : 804,
          "index_count" : 297
        },
        {
          "name" : "ip_range",
          "count" : 1,
          "index_count" : 1
        },
        {
          "name" : "keyword",
          "count" : 32155,
          "index_count" : 397
        },
        {
          "name" : "long",
          "count" : 15289,
          "index_count" : 260
        },
        {
          "name" : "long_range",
          "count" : 1,
          "index_count" : 1
        },
        {
          "name" : "nested",
          "count" : 18,
          "index_count" : 13
        },
        {
          "name" : "object",
          "count" : 19492,
          "index_count" : 398
        },
        {
          "name" : "scaled_float",
          "count" : 280,
          "index_count" : 40
        },
        {
          "name" : "shape",
          "count" : 1,
          "index_count" : 1
        },
        {
          "name" : "short",
          "count" : 155,
          "index_count" : 78
        },
        {
          "name" : "text",
          "count" : 6585,
          "index_count" : 262
        }
      ]
    },
    "analysis" : {
      "char_filter_types" : [ ],
      "tokenizer_types" : [ ],
      "filter_types" : [
        {
          "name" : "pattern_capture",
          "count" : 1,
          "index_count" : 1
        }
      ],
      "analyzer_types" : [
        {
          "name" : "custom",
          "count" : 1,
          "index_count" : 1
        }
      ],
      "built_in_char_filters" : [ ],
      "built_in_tokenizers" : [
        {
          "name" : "uax_url_email",
          "count" : 1,
          "index_count" : 1
        }
      ],
      "built_in_filters" : [
        {
          "name" : "lowercase",
          "count" : 1,
          "index_count" : 1
        },
        {
          "name" : "unique",
          "count" : 1,
          "index_count" : 1
        }
      ],
      "built_in_analyzers" : [ ]
    }
  },
  "nodes" : {
    "count" : {
      "total" : 6,
      "coordinating_only" : 0,
      "data" : 2,
      "data_cold" : 2,
      "data_content" : 2,
      "data_hot" : 2,
      "data_warm" : 2,
      "ingest" : 3,
      "master" : 3,
      "ml" : 6,
      "remote_cluster_client" : 6,
      "transform" : 2,
      "voting_only" : 0
    },
    "versions" : [
      "7.10.1"
    ],
    "os" : {
      "available_processors" : 68,
      "allocated_processors" : 68,
      "names" : [
        {
          "name" : "Linux",
          "count" : 6
        }
      ],
      "pretty_names" : [
        {
          "pretty_name" : "3scale",
          "count" : 6
        }
      ],
      "mem" : {
        "total_in_bytes" : 303826444288,
        "free_in_bytes" : 42680385536,
        "used_in_bytes" : 261146058752,
        "free_percent" : 14,
        "used_percent" : 86
      }
    },
    "process" : {
      "cpu" : {
        "percent" : 31
      },
      "open_file_descriptors" : {
        "min" : 466,
        "max" : 3664,
        "avg" : 1577
      }
    },
    "jvm" : {
      "max_uptime_in_millis" : 18459921446,
      "versions" : [
        {
          "version" : "15.0.1",
          "vm_name" : "OpenJDK 64-Bit Server VM",
          "vm_version" : "15.0.1+9",
          "vm_vendor" : "AdoptOpenJDK",
          "bundled_jdk" : true,
          "using_bundled_jdk" : true,
          "count" : 4
        },
        {
          "version" : "1.8.0_171",
          "vm_name" : "Java HotSpot(TM) 64-Bit Server VM",
          "vm_version" : "25.171-b11",
          "vm_vendor" : "Oracle Corporation",
          "bundled_jdk" : true,
          "using_bundled_jdk" : false,
          "count" : 2
        }
      ],
      "mem" : {
        "heap_used_in_bytes" : 22554858720,
        "heap_max_in_bytes" : 47000453120
      },
      "threads" : 713
    },
    "fs" : {
      "total_in_bytes" : 2610194092032,
      "free_in_bytes" : 598683987968,
      "available_in_bytes" : 471537152000
    },
    "plugins" : [ ],
    "network_types" : {
      "transport_types" : {
        "security4" : 6
      },
      "http_types" : {
        "security4" : 6
      }
    },
    "discovery_types" : {
      "zen" : 6
    },
    "packaging_types" : [
      {
        "flavor" : "default",
        "type" : "tar",
        "count" : 6
      }
    ],
    "ingest" : {
      "number_of_pipelines" : 7,
      "processor_stats" : {
        "conditional" : {
          "count" : 32470024521,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 488502
        },
        "geoip" : {
          "count" : 32470024521,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 109489
        },
        "gsub" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "pipeline" : {
          "count" : 129880098084,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 888476
        },
        "script" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "user_agent" : {
          "count" : 32470024521,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 75705
        }
      }
    }
  }
}

Here's our nodes info

172.70.12.161  9 79  5 0.43 0.33 0.32 ilmr     * node-m2
172.70.12.171 36 99 36 7.50 8.73 9.75 cdhlrstw - node-d2
172.70.12.161 45 79  5 0.43 0.33 0.32 lr       - node-lb1
172.70.12.161 31 79  5 0.43 0.33 0.32 ilmr     - node-m1
172.70.12.171 50 99 36 7.50 8.73 9.75 cdhlrstw - node-d1
172.70.12.161 44 79  5 0.43 0.33 0.32 ilmr     - node-m3

Feel free to ask more information if you needed it

Thanks!

I have a few comments on your current deployment:

  1. Based on the node configuration it looks like you have configured the cluster for high availability given that you have 3 dedicated master nodes. These are however all located on the same node, which means you have no resiliency at all. In order to achieve high availability you need a minimum of 3 hosts/VMs and master eligible nodes must be allocated evenly across these. Please see the official docs for more information and details.
  2. Dedicated master nodes can typically have less resources than data nodes and should be left to manage the cluster and not serve traffic. They should therefore not have the ingest role. Instead make the data nodes (and/or coordinating only nodes) ingest nodes.
  3. You have 2 data nodes, but these are both located on the same host. This probably makes the VM load quite unbalanced, but also eliminates any high availability. You need to ensure data nodes are distributed across different hosts.
  4. You seem to have a single coordinating only node which is also an issue from a high availability perspective.

If you only have 3 VMs and are not expecting the cluster to grow dramatically over time I would recommend setting up one node per VM with the default configuration (all roles) instead of your current configuration. Just because Elasticsearch allows you to create nodes with different profiles does not mean this is something you should do for smaller clusters. It looks to me like you have overcomplicated you setup for very limited gains (if any at all).

Hi, Thanks for your replies!

If you only have 3 VMs and are not expecting the cluster to grow dramatically over time I would recommend setting up one node per VM with the default configuration (all roles)

Thanks for the insight here, we'll continue to try using all node roles for 3 of the VM.

Thanks, and sorry for late responses.