Hi, we have a simple ES cluster consisting of 3 master-eligible nodes, which are also data nodes:
node-1: master
node-2: replica with shards allocated
node-3: empty replica with no shards
Now we need to replace node-2 with node-new without incurring any downtime. I believe we can't simply shut it down with that in mind. Should we first disable shard allocation on that node:
PUT _cluster/settings
{
"persistent": {
"cluster.routing.allocation.enable": "primaries"
}
}
Wait for it to complete and then it's safe to shut node-2 down? After that we'll let node-new join the cluster and move shards from node-3 to it (which is desirable in our case)? Thanks.
Elasticsearch, by design, should be managing shard allocation and replication, and it should not be putting all of a cluster's shards on a single node. I could see that maybe happening if you had very few indices and all of your indices or data stream backing indices were set to one primary and no replicas, but that's not very "real world". The situation you're describing does not allow Elasticsearch to maintain its resilience - for example, if your node-2 was to fail without warning, your data would be gone.
With a properly configured cluster you should be able to use the guidance Elastic provides at Add and remove nodes in your cluster | Elasticsearch Guide [7.14] | Elastic to add and/or remove nodes. The process described takes advantage of the automatic shard allocation and allows adding and removing nodes with no downtime.
Thanks, @RLPowellJr. Quoting the guide you linked to:
Voting exclusions are only required when removing at least half of the master-eligible nodes from a cluster in a short time period. They are not required when removing master-ineligible nodes, nor are they required when removing fewer than half of the master-eligible nodes.
It may seem that it's ok just to shut node-2 (by sending it SIGTERM), which is is both a master-eligible and a data node servicing many requests, without any preliminary steps, but won't this break current client connections and otherwise cause downtime? I understand that a clean shut down is different from a sudden server/network failure, but does ES handle it properly? From my past experience shutting a busy node (part of a 3-node cluster) down did cause some outage.
A lot depends on how your client connections are set up. If they point to just the node-2 host, either by name or IP, then you're exactly right - there will be an outage. If they point to more than one of the Elasticsearch nodes, such as how Kibana can point to an array of Elasticsearch nodes (see the elasticsearch.hosts information at Configure Kibana | Kibana Guide [7.14] | Elastic), then the connection should go to another node in the cluster.
If your three-node cluster is otherwise healthy and able to provide resilience (that is, it's distributing primary and replica shards across the nodes and re-allocating as needed), then the loss of a node, whether by a clean shutdown or sudden failure, should be survivable. If the cluster is very heavily loaded with ingest, queries, or both, then there might be some follow-on issues with responsiveness. If that's the case, consider adding more nodes to your base configuration.
Thanks. "Clients" is typically Ruby on Rails code using searchkick, elasticsearch, elasticsearch-api & elasticsearch-transport gems and pointing at all three nodes. Maybe our developers aren't setting up the connection properly, but to be on the safe side, can I manually move shards from node-2 to node-3, shut the empty node-2 down, move shards from node-2 to node-new after it joins the cluster, and then fail back to the automatic management of the data in the cluster by ES? As if nothing happened.
Other than being slower, manually moving shards should be as effective as allowing Elasticsearch to move them - just make sure you find everything from node-2 that needs to be migrated to node-3.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.