Stopping the entire cluster without any rebalancing

Hi

I am sure I have misunderstood something, so this question might be
"stupid", but I ask anyway.

As I understand it, when a node leaves the cluster, the rest of the
cluster starts reestablishing (on the remaining nodes), the shards
(primary and/or replica) that ran on the node just stopped. This is
fine behaviour, but I guess that you sometimes just want to stop/start
the entire cluster without triggering any rebalancing/reestablishment-
processes. But when stopping all the nodes in the cluster one at the
time, some nodes will be stopped a little bit before others - wont
that start rebalancing/reestablishment-processes between the time some
of the nodes has been stopped and the time where they have all been
stopped, and isnt that potentially a problem?

Is there a way to stop/start the cluster without triggering any
rebalancing/reestablishment-processes?

Regards, Per Steffensen

If there is a good solution for this , i would also like to know.

Thanks
Vineeth

On Tue, Oct 18, 2011 at 6:59 PM, Steff steff@designware.dk wrote:

Hi

I am sure I have misunderstood something, so this question might be
"stupid", but I ask anyway.

As I understand it, when a node leaves the cluster, the rest of the
cluster starts reestablishing (on the remaining nodes), the shards
(primary and/or replica) that ran on the node just stopped. This is
fine behaviour, but I guess that you sometimes just want to stop/start
the entire cluster without triggering any rebalancing/reestablishment-
processes. But when stopping all the nodes in the cluster one at the
time, some nodes will be stopped a little bit before others - wont
that start rebalancing/reestablishment-processes between the time some
of the nodes has been stopped and the time where they have all been
stopped, and isnt that potentially a problem?

Is there a way to stop/start the cluster without triggering any
rebalancing/reestablishment-processes?

Regards, Per Steffensen

Yes, you can send a shutdown command that will shutdown the whole cluster:
Elasticsearch Platform — Find real-time answers at scale | Elastic.
When you shutdown the full cluster, it will make sure to not do rebalancing.

On Tue, Oct 18, 2011 at 3:29 PM, Steff steff@designware.dk wrote:

Hi

I am sure I have misunderstood something, so this question might be
"stupid", but I ask anyway.

As I understand it, when a node leaves the cluster, the rest of the
cluster starts reestablishing (on the remaining nodes), the shards
(primary and/or replica) that ran on the node just stopped. This is
fine behaviour, but I guess that you sometimes just want to stop/start
the entire cluster without triggering any rebalancing/reestablishment-
processes. But when stopping all the nodes in the cluster one at the
time, some nodes will be stopped a little bit before others - wont
that start rebalancing/reestablishment-processes between the time some
of the nodes has been stopped and the time where they have all been
stopped, and isnt that potentially a problem?

Is there a way to stop/start the cluster without triggering any
rebalancing/reestablishment-processes?

Regards, Per Steffensen

On Oct 18, 8:05 pm, Shay Banon kim...@gmail.com wrote:

Yes, you can send a shutdown command that will shutdown the whole cluster:Elasticsearch Platform — Find real-time answers at scale | Elastic....
When you shutdown the full cluster, it will make sure to not do rebalancing.

Thanks!

On Oct 18, 8:05 pm, Shay Banon kim...@gmail.com wrote:

Yes, you can send a shutdown command that will shutdown the whole cluster:Elasticsearch Platform — Find real-time answers at scale | Elastic....
When you shutdown the full cluster, it will make sure to not do rebalancing.

Wont we need the same thing for starting the cluster? Imagine that you
are starting the cluster. Some of the nodes will be up and running
"first", holding primaries or replicas of shards that are also
represented on some of the nodes that are not yet up and running. The
nodes that are "first" up and running realize that the primaries or
replicas on the nodes not up yet are missing. The nodes "first" up and
running will start reestablishing those "missing" primaries and
replicas, but they shouldnt because we know that the nodes running
those primaries and replicas will be up and running in a few secs.
Isnt this a problem also, and are there a way around that? E.g. when
starting nodes, to tell them that a full-cluster start is going on,
and that they therefore should not try to reestablish anything before
the full cluster have had a chance to start - e.g. not before 1 min or
so has passed.

Regards, Per Steffensen

Yes, thats what the three gateway settings are there for:
https://github.com/elasticsearch/elasticsearch/blob/master/config/elasticsearch.yml#L241.
They control when the initial recovery of the full cluster state will begin,
and then result in shard allocation.

On Mon, Oct 24, 2011 at 8:02 PM, Steff steff@designware.dk wrote:

On Oct 18, 8:05 pm, Shay Banon kim...@gmail.com wrote:

Yes, you can send a shutdown command that will shutdown the whole
cluster:
Elasticsearch Platform — Find real-time answers at scale | Elastic....
When you shutdown the full cluster, it will make sure to not do
rebalancing.

Wont we need the same thing for starting the cluster? Imagine that you
are starting the cluster. Some of the nodes will be up and running
"first", holding primaries or replicas of shards that are also
represented on some of the nodes that are not yet up and running. The
nodes that are "first" up and running realize that the primaries or
replicas on the nodes not up yet are missing. The nodes "first" up and
running will start reestablishing those "missing" primaries and
replicas, but they shouldnt because we know that the nodes running
those primaries and replicas will be up and running in a few secs.
Isnt this a problem also, and are there a way around that? E.g. when
starting nodes, to tell them that a full-cluster start is going on,
and that they therefore should not try to reestablish anything before
the full cluster have had a chance to start - e.g. not before 1 min or
so has passed.

Regards, Per Steffensen

On 25 Okt., 00:33, Shay Banon kim...@gmail.com wrote:

Yes, thats what the three gateway settings are there for:https://github.com/elasticsearch/elasticsearch/blob/master/config/ela....
They control when the initial recovery of the full cluster state will begin,
and then result in shard allocation.

I just found the gateway settings and thought that they might be used
for that. Now you confirmed. Thanks a lot!

On 18 Okt., 20:05, Shay Banon kim...@gmail.com wrote:

Yes, you can send a shutdown command that will shutdown the whole cluster:Elasticsearch Platform — Find real-time answers at scale | Elastic....
When you shutdown the full cluster, it will make sure to not do rebalancing.

BTW, it there a way to explicitly name the cluster that you intend to
shutdown when doing such a cluster-shutdown. I know you are using
and , and that the node you get through that will only be
part of one cluster, but it might be nice if you where actually able
to give the name of the cluster you intend to shutdown (more explicit
that and ) in order to not shutdown the wrong cluster :slight_smile:

IMO

A node is a member of only one cluster.
A node from cluster A doesn't see cluster B.

So I don't think it's possible.

But you can ask for the cluster name of you current node and Check that it's the right cluster before stopping it.

David :wink:

Le 26 oct. 2011 à 09:08, Steff steff@designware.dk a écrit :

On 18 Okt., 20:05, Shay Banon kim...@gmail.com wrote:

Yes, you can send a shutdown command that will shutdown the whole cluster:Elasticsearch Platform — Find real-time answers at scale | Elastic....
When you shutdown the full cluster, it will make sure to not do rebalancing.

BTW, it there a way to explicitly name the cluster that you intend to
shutdown when doing such a cluster-shutdown. I know you are using
and , and that the node you get through that will only be
part of one cluster, but it might be nice if you where actually able
to give the name of the cluster you intend to shutdown (more explicit
that and ) in order to not shutdown the wrong cluster :slight_smile:

We shutdown using the method described here:
http://www.elasticsearch.org/guide/reference/api/admin-cluster-nodes-shutdown.html
Seems to work fine. But when starting the nodes again shards might get
relocated, even though all nodes are started very quickly after each
other. Isnt relocation of shards supposed to be avoided if you restart
all "gateway.expected_nodes" nodes with a "gateway.recover_after_time"
period of time. We started all nodes within half a minute and we have
"gateway.recover_after_time" set to 5m, but shards where still
relocated. Any explanation or elaboration? Why does this happen?

Thanks, Steff

On 31 Okt., 11:04, Steff st...@designware.dk wrote:

We shutdown using the method described here:Elasticsearch Platform — Find real-time answers at scale | Elastic...
Seems to work fine. But when starting the nodes again shards might get
relocated, even though all nodes are started very quickly after each
other. Isnt relocation of shards supposed to be avoided if you restart
all "gateway.expected_nodes" nodes with a "gateway.recover_after_time"
period of time. We started all nodes within half a minute and we have
"gateway.recover_after_time" set to 5m, but shards where still
relocated. Any explanation or elaboration? Why does this happen?

Any comment on this one. It can be reproduced in a very simple way
(elasticsearch-0.18.2).
My elasticsearch.yml looks like this

cluster.name: tltsteff_es

action.auto_create_index: false

discovery.zen.ping.multicast.enabled: true

node.data: true

gateway.recover_after_nodes: 1
gateway.recover_after_time: 5m
gateway.expected_nodes: 3

discovery.zen.minimum_master_nodes: 1

discovery.zen.ping.timeout: 3s

Step-by-step to reproduce (Im running on Mac OS X Snow Leopard):

  1. cd <elasticsearch-0.18.2-install>
  2. Execute the following command 3 times quickly after each other, in
    order to start 3 nodes in the same cluster: ./bin/elasticsearch
  3. Start elasticsearch-head and connect to http://localhost:9200/ -
    wait until all nodes have started and joined the cluster (green state)
  4. Using elasticsearch-head create 3 indices ("index1", "index2" and
    "index3") each with 3 shards and 1 replica
  5. Observe how the shards are distributed among nodes. They are
    probably very nicely distributed (each node running one primary shard
    and one replica for each index)
  6. Stop the cluster - using: curl -XPOST 'http://localhost:9200/
    _shutdown'
  7. Execute the following command 3 times quickly after each other, in
    order to start 3 nodes in the same cluster: ./bin/elasticsearch
  8. Observe how the shards are distributed among nodes - the
    distribution has probably changed

I do not understand why shards are relocated among the 3 nodes just
becuase the cluster consisting of 3 local nodes is shut down, and 3
new local nodes are started. I think, they are just supposed each to
take the shard-allocation of one of the 3 nodes that ran before the
shut down, and not do any relocation of shards.

It does not seem like it is any gateway stuff causing this moving
shards around - at least I see (running with gateway DEBUG log) the
following line in the log: delaying initial state recovery for [5m]

But I also see the following in the log

[2011-11-02 13:13:42,638][INFO ][cluster.service          ]
[Hurricane] new_master [Hurricane][98CZj6QVSy2aiAEr1m_lxg][inet[/
192.168.1.107:9300]], reason: zen-disco-join (elected_as_master)
[2011-11-02 13:13:42,647][INFO ][discovery                ]
[Hurricane] tltsteff_es/98CZj6QVSy2aiAEr1m_lxg
[2011-11-02 13:13:42,653][DEBUG][gateway.local            ]
[Hurricane] [find_latest_state]: loading metadata from [/Applications/
elasticsearch-0.18.2/data/tltsteff_es/nodes/0/_state/metadata-4]
[2011-11-02 13:13:42,655][DEBUG][gateway.local            ]
[Hurricane] [find_latest_state]: loading started shards from [/
Applications/elasticsearch-0.18.2/data/tltsteff_es/nodes/0/_state/
shards-20]
[2011-11-02 13:13:42,656][DEBUG][gateway                  ]
[Hurricane] delaying initial state recovery for [5m]
[2011-11-02 13:13:42,660][INFO ][http                     ]
[Hurricane] bound_address {inet[/0.0.0.0:9200]}, publish_address
{inet[/192.168.1.107:9200]}
[2011-11-02 13:13:42,660][INFO ][node                     ]
[Hurricane] {0.18.2}[1462]: started
[2011-11-02 13:13:44,465][INFO ][cluster.service          ]
[Hurricane] added {[Cap 'N Hawk][W32DAJKFRV2zMPqw3vnuxA][inet[/
192.168.1.107:9302]],}, reason: zen-disco-receive(join from node[[Cap
'N Hawk][W32DAJKFRV2zMPqw3vnuxA][inet[/192.168.1.107:9302]]])
[2011-11-02 13:13:44,492][INFO ][cluster.service          ] [Cap 'N
Hawk] detected_master [Hurricane][98CZj6QVSy2aiAEr1m_lxg][inet[/
192.168.1.107:9300]], added {[Hurricane][98CZj6QVSy2aiAEr1m_lxg][inet[/
192.168.1.107:9300]],}, reason: zen-disco-receive(from master
[[Hurricane][98CZj6QVSy2aiAEr1m_lxg][inet[/192.168.1.107:9300]]])
[2011-11-02 13:13:44,506][INFO ][discovery                ] [Cap 'N
Hawk] tltsteff_es/W32DAJKFRV2zMPqw3vnuxA
[2011-11-02 13:13:44,511][DEBUG][gateway.local            ] [Cap 'N
Hawk] [find_latest_state]: loading metadata from [/Applications/
elasticsearch-0.18.2/data/tltsteff_es/nodes/2/_state/metadata-4]
[2011-11-02 13:13:44,513][DEBUG][gateway.local            ] [Cap 'N
Hawk] [find_latest_state]: loading started shards from [/Applications/
elasticsearch-0.18.2/data/tltsteff_es/nodes/2/_state/shards-20]
[2011-11-02 13:13:44,531][INFO ][http                     ] [Cap 'N
Hawk] bound_address {inet[/0.0.0.0:9201]}, publish_address {inet[/
192.168.1.107:9201]}
[2011-11-02 13:13:44,533][INFO ][node                     ] [Cap 'N
Hawk] {0.18.2}[1490]: started
[2011-11-02 13:13:46,581][INFO ][cluster.service          ]
[Hurricane] added {[Shamrock][DRBdotspSPmv6HUHgpmuog][inet[/
192.168.1.107:9301]],}, reason: zen-disco-receive(join from
node[[Shamrock][DRBdotspSPmv6HUHgpmuog][inet[/192.168.1.107:9301]]])
[2011-11-02 13:13:46,590][INFO ][cluster.service          ] [Cap 'N
Hawk] added {[Shamrock][DRBdotspSPmv6HUHgpmuog][inet[/
192.168.1.107:9301]],}, reason: zen-disco-receive(from master
[[Hurricane][98CZj6QVSy2aiAEr1m_lxg][inet[/192.168.1.107:9300]]])
[2011-11-02 13:13:46,602][DEBUG][gateway.local            ] [Shamrock]
[find_latest_state]: loading metadata from [/Applications/
elasticsearch-0.18.2/data/tltsteff_es/nodes/1/_state/metadata-4]
[2011-11-02 13:13:46,604][DEBUG][gateway.local            ] [Shamrock]
[find_latest_state]: loading started shards from [/Applications/
elasticsearch-0.18.2/data/tltsteff_es/nodes/1/_state/shards-20]
[2011-11-02 13:13:46,606][INFO ][cluster.service          ] [Shamrock]
detected_master [Hurricane][98CZj6QVSy2aiAEr1m_lxg][inet[/
192.168.1.107:9300]], added {[Hurricane][98CZj6QVSy2aiAEr1m_lxg][inet[/
192.168.1.107:9300]],[Cap 'N Hawk][W32DAJKFRV2zMPqw3vnuxA][inet[/
192.168.1.107:9302]],}, reason: zen-disco-receive(from master
[[Hurricane][98CZj6QVSy2aiAEr1m_lxg][inet[/192.168.1.107:9300]]])
[2011-11-02 13:13:46,609][DEBUG][gateway.local            ]
[Hurricane] elected state from [[Hurricane][98CZj6QVSy2aiAEr1m_lxg]
[inet[/192.168.1.107:9300]]]
[2011-11-02 13:13:46,611][INFO ][discovery                ] [Shamrock]
tltsteff_es/DRBdotspSPmv6HUHgpmuog
[2011-11-02 13:13:46,620][INFO ][http                     ] [Shamrock]
bound_address {inet[/0.0.0.0:9202]}, publish_address {inet[/
192.168.1.107:9202]}
[2011-11-02 13:13:46,620][INFO ][node                     ] [Shamrock]
{0.18.2}[1476]: started
[2011-11-02 13:13:46,627][DEBUG][gateway.local            ]
[Hurricane] [index1][0]: allocating [[index1][0], node[null], [P],
s[UNASSIGNED]] to [[Cap 'N Hawk][W32DAJKFRV2zMPqw3vnuxA][inet[/
192.168.1.107:9302]]] on primary allocation
[2011-11-02 13:13:46,629][DEBUG][gateway.local            ]
[Hurricane] [index1][1]: allocating [[index1][1], node[null], [P],
s[UNASSIGNED]] to [[Hurricane][98CZj6QVSy2aiAEr1m_lxg][inet[/
192.168.1.107:9300]]] on primary allocation
[2011-11-02 13:13:46,631][DEBUG][gateway.local            ]
[Hurricane] [index1][2]: allocating [[index1][2], node[null], [P],
s[UNASSIGNED]] to [[Hurricane][98CZj6QVSy2aiAEr1m_lxg][inet[/
192.168.1.107:9300]]] on primary allocation
[2011-11-02 13:13:46,633][DEBUG][gateway.local            ]
[Hurricane] [index2][0]: allocating [[index2][0], node[null], [P],
s[UNASSIGNED]] to [[Cap 'N Hawk][W32DAJKFRV2zMPqw3vnuxA][inet[/
192.168.1.107:9302]]] on primary allocation
[2011-11-02 13:13:46,635][DEBUG][gateway.local            ]
[Hurricane] [index2][1]: allocating [[index2][1], node[null], [P],
s[UNASSIGNED]] to [[Hurricane][98CZj6QVSy2aiAEr1m_lxg][inet[/
192.168.1.107:9300]]] on primary allocation
[2011-11-02 13:13:46,637][DEBUG][gateway.local            ]
[Hurricane] [index2][2]: allocating [[index2][2], node[null], [P],
s[UNASSIGNED]] to [[Hurricane][98CZj6QVSy2aiAEr1m_lxg][inet[/
192.168.1.107:9300]]] on primary allocation
[2011-11-02 13:13:46,639][DEBUG][gateway.local            ]
[Hurricane] [index3][0]: allocating [[index3][0], node[null], [P],
s[UNASSIGNED]] to [[Cap 'N Hawk][W32DAJKFRV2zMPqw3vnuxA][inet[/
192.168.1.107:9302]]] on primary allocation
[2011-11-02 13:13:46,641][DEBUG][gateway.local            ]
[Hurricane] [index3][1]: allocating [[index3][1], node[null], [P],
s[UNASSIGNED]] to [[Shamrock][DRBdotspSPmv6HUHgpmuog][inet[/
192.168.1.107:9301]]] on primary allocation
[2011-11-02 13:13:46,643][DEBUG][gateway.local            ]
[Hurricane] [index3][2]: allocating [[index3][2], node[null], [P],
s[UNASSIGNED]] to [[Cap 'N Hawk][W32DAJKFRV2zMPqw3vnuxA][inet[/
192.168.1.107:9302]]] on primary allocation
[2011-11-02 13:13:47,028][DEBUG][index.gateway            ]
[Hurricane] [index1][1] starting recovery from local ...

In the start of the log it seem like each of the 3 new nodes
(Hurricane, Cap 'N Hawk and Shamrock) take each their node-folder (/
Applications/elasticsearch-0.18.2/data/tltsteff_es/nodes/0, /
Applications/elasticsearch-0.18.2/data/tltsteff_es/nodes/2 and /
Applications/elasticsearch-0.18.2/data/tltsteff_es/nodes/1
respectively) in the data-folder. Thats nice. But afterwards it seems
like Hurricane (the master I guess) findes out that all 9 shards are
UNASSIGNED and starts assigning them to the running nodes (apparently
without taking any consideration to which nodes already have the
shard). I do not understand why this happens. It is not so bad in this
case where the nodes are all running on the same machine and where
there is no data in any of the indices :slight_smile: but in general it is stupid
moving shards around after restart when "no nodes are missing".

What am I missing. Some explanation will be greatly appreciated.

Regards, Per Steffensen