(and to re-enable it again later without restarting the process?)
so the master node will promote replica shards from an another node to primary ones, node_left timer will be started, and so on?
This happens automatically whenever a node falls out of the cluster, then replica shards on the other nodes will get promoted to primary shards to replace the primaries on the node that fell out. When the node returns to the cluster it will only contain replica shards.
What I don't understand is why you want to "move" the primary shards away from one specific node. This is kind of pointless since the replica shards also have to be updated, whenever a primary is, and usually will get as many search requests to process as the primary. And even if you "move" all primary shards to other nodes they will trickle back when other nodes fall out of the cluster, which will happen, or when a new index is created in the cluster.
Actually, I'm planning a rolling upgrade, but I need a way to gracefully kick off the node out of the cluster first, before sending it SIGTERM that may take longer to process if the node is still in the cluster (e.g. because of pending index requests) and may cause killing the process by the underlying infrastructure after a timeout, which is not desirable.
There's software under the hood allowing to implement two-phase shutdown procedure, and the first phase is supposed to be asynchronous, i.e. send a "graceful shutdown" request, wait until all background tasks on that node are finished (e.g. by periodically polling the node), and only then send SIGTERM (which should be handled much faster this time).
I've implemented a controller to do something similar; it sends a SIGTERM (kill -15 ) to the running Elasticsearch process and waits for it to shut down gracefully. It uses SIGKILL (kill -9 ) only for emergency stops or if the SIGTERM fails to stop the process within a given time limit. The controller works equally well whether the node has just replica shards or primary shards.
I have a similar thing, but I'd like to avoid killing the process whenever possible, so I need a way to ensure that SIGTERM will be always handled as fast as possible.
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.