Ability to stop and start a cluster without shard movement part2

Had to park this issue as we had other things, restarting this thread.

. We tested ES cluster stop and start again with the equal number of shards (204) in all the 7 nodes.

[root@elasticsearch2 share]# curl 'elasticsearch7:9200/_cat/allocation?v'
shards disk.indices disk.used disk.avail disk.total disk.percent host ip node
204 56.9gb 64.5gb 83gb 147.5gb 43 10.0.0.17 10.0.0.17 wht1qYx
204 57.3gb 65.1gb 82.3gb 147.5gb 44 10.0.0.9 10.0.0.9 70eRJrE
204 57.5gb 65.1gb 82.3gb 147.5gb 44 10.0.0.12 10.0.0.12 zlHb2gq
204 57.8gb 65.4gb 82gb 147.5gb 44 10.0.0.15 10.0.0.15 idNEX8h
204 57.4gb 65.3gb 82.2gb 147.5gb 44 10.0.0.20 10.0.0.20 7sNEZsb
204 57.3gb 64.9gb 82.5gb 147.5gb 44 10.0.0.14 10.0.0.14 lURUCVH
204 57.4gb 65gb 82.4gb 147.5gb 44 10.0.0.8 10.0.0.8 Y0EJjkY

  1.   Below is the output after disabling the allocation. Number of shards is not even in 7 ES nodes and also we can see 714 shards unassigned.
    

[root@elasticsearch7 share]# curl 'elasticsearch7:9200/_cat/allocation?v'
shards disk.indices disk.used disk.avail disk.total disk.percent host ip node
0 0b 65.5gb 81.9gb 147.5gb 44 10.0.0.17 10.0.0.17 7sNEZsb
155 49.5gb 64.5gb 83gb 147.5gb 43 10.0.0.9 10.0.0.9 wht1qYx
150 49.3gb 65.4gb 82gb 147.5gb 44 10.0.0.12 10.0.0.12 idNEX8h
62 5.4gb 65.1gb 82.3gb 147.5gb 44 10.0.0.15 10.0.0.15 zlHb2gq
155 54gb 64.9gb 82.5gb 147.5gb 44 10.0.0.13 10.0.0.13 lURUCVH
86 9.6gb 65.1gb 82.4gb 147.5gb 44 10.0.0.14 10.0.0.14 Y0EJjkY
106 32.8gb 65gb 82.4gb 147.5gb 44 10.0.0.8 10.0.0.8 70eRJrE
714 UNASSIGNED

[root@elasticsearch7 share]# curl -XGET 'elasticsearch2:9200/_cluster/health?pretty'
{
"cluster_name" : "elastic-search-cluster",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 7,
"number_of_data_nodes" : 7,
"active_primary_shards" : 714,
"active_shards" : 714,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 714,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 50.0
}

  1.   Below is the output after enabling the allocation after restarting the nodes. We can see the relocation of shards happening to balance the number of shards in the cluster nodes..
    

[root@elasticsearch7 share]# curl -XGET 'elasticsearch2:9200/_cluster/health?pretty'
{
"cluster_name" : "elastic-search-cluster",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 7,
"number_of_data_nodes" : 7,
"active_primary_shards" : 714,
"active_shards" : 1428,
"relocating_shards" : 2,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 100.0
}

This is related to the original thread
The logs are attached on Ability to stop and start a cluster without shard movement

The original issue is Ability to stop and start a cluster without shard movement

Thanks, the logs are helpful.

You set cluster.routing.allocation.enable: "none" here:

[2018-06-04T12:13:56,589][INFO ][o.e.c.s.ClusterSettings ] [70eRJrE] updating [cluster.routing.allocation.enable] from [ALL] to [none]

The logs indicate that all 1428 shards are allocated at this time:

Number of STARTED/RELOCATING shards by node at 2018-06-04T12:13:56,644
70eRJrE_T7SMoPLC11i5Kg 204/0
7sNEZsbNSU-CxzG_64n1CQ 204/0
Y0EJjkYjS3aRj5YDJvStMQ 204/0
idNEX8heSdK99ZQYgSvFHQ 204/0
lURUCVHCQUCfZMdngLrlmA 204/0
wht1qYxoRuKMDrrbNQXFnA 204/0
zlHb2gqcSjqaO2Abl4bpZQ 204/0
total                  1428/0

The next allocation that takes place is this one:

Number of STARTED/RELOCATING shards by node at 2018-06-04T12:46:08,771
70eRJrE_T7SMoPLC11i5Kg 204/0
7sNEZsbNSU-CxzG_64n1CQ 0/0
Y0EJjkYjS3aRj5YDJvStMQ 204/0
idNEX8heSdK99ZQYgSvFHQ 204/0
lURUCVHCQUCfZMdngLrlmA 204/0
wht1qYxoRuKMDrrbNQXFnA 204/0
zlHb2gqcSjqaO2Abl4bpZQ 204/0
total                  1224/0

There are fewer shards allocated because node 7sNEZsbNSU-CxzG_64n1CQ seems to have shut down:

[2018-06-04T12:46:08,779][INFO ][o.e.c.r.a.AllocationService] [70eRJrE] Cluster health status changed from [GREEN] to [YELLOW] (reason: [{7sNEZsb}{7sNEZsbNSU-CxzG_64n1CQ}{7dfOZyo6SiKnUjkcFIvWZw}{10.0.0.20}{10.0.0.20:9300} transport disconnected]).

Two more nodes subsequently shut down:

[2018-06-04T12:46:57,224][INFO ][o.e.c.s.ClusterService   ] [70eRJrE] removed {{zlHb2gq}{zlHb2gqcSjqaO2Abl4bpZQ}{wBqonQhvRvaHiOZaAm0Ayw}{10.0.0.12}{10.0.0.12:9300},}, reason: zen-disco-node-failed({zlHb2gq}{zlHb2gqcSjqaO2Abl4bpZQ}{wBqonQhvRvaHiOZaAm0Ayw}{10.0.0.12}{10.0.0.12:9300}), reason(transport disconnected)[{zlHb2gq}{zlHb2gqcSjqaO2Abl4bpZQ}{wBqonQhvRvaHiOZaAm0Ayw}{10.0.0.12}{10.0.0.12:9300} transport disconnected]
[2018-06-04T12:47:27,958][INFO ][o.e.c.s.ClusterService   ] [70eRJrE] removed {{Y0EJjkY}{Y0EJjkYjS3aRj5YDJvStMQ}{Mkjl7USDTOKo1vHQYrTSrw}{10.0.0.8}{10.0.0.8:9300},}, reason: zen-disco-node-failed({Y0EJjkY}{Y0EJjkYjS3aRj5YDJvStMQ}{Mkjl7USDTOKo1vHQYrTSrw}{10.0.0.8}{10.0.0.8:9300}), reason(transport disconnected)[{Y0EJjkY}{Y0EJjkYjS3aRj5YDJvStMQ}{Mkjl7USDTOKo1vHQYrTSrw}{10.0.0.8}{10.0.0.8:9300} transport disconnected]

At this point the cluster has lost its quorum of master nodes:

[2018-06-04T12:48:18,712][WARN ][o.e.d.z.ZenDiscovery     ] [70eRJrE] not enough master nodes, current nodes: nodes:
   {lURUCVH}{lURUCVHCQUCfZMdngLrlmA}{D2cyzOiuSw2ima58-hT_1g}{10.0.0.14}{10.0.0.14:9300}
   {wht1qYx}{wht1qYxoRuKMDrrbNQXFnA}{SQyNcSZETIaHYcYxmR4CBA}{10.0.0.17}{10.0.0.17:9300}
   {idNEX8h}{idNEX8heSdK99ZQYgSvFHQ}{qqxrx8U0SoiVtZOFD06cxA}{10.0.0.15}{10.0.0.15:9300}
   {70eRJrE}{70eRJrE_T7SMoPLC11i5Kg}{hiYGqZpEQBG8pY2mNFgbjw}{10.0.0.9}{10.0.0.9:9300}, local, master

It makes a lot of noise about DNS failures like this:

[2018-06-04T12:48:24,793][WARN ][o.e.d.z.UnicastZenPing   ] [70eRJrE] failed to resolve host [elasticsearch5]
java.net.UnknownHostException: elasticsearch5

The node is restarted:

[2018-06-04T12:51:59,272][INFO ][o.e.n.Node               ] [] initializing ...

(I do not see any messages indicating it was cleanly shut down, which is of some concern)

When some other nodes are also restarted it is elected master again (and still has cluster.routing.allocation.enable: "none"):

[2018-06-04T12:53:29,397][INFO ][o.e.c.s.ClusterService   ] [70eRJrE] new_master {70eRJrE}{70eRJrE_T7SMoPLC11i5Kg}{f2nNJNnGSE6LfWS9-vweEA}{10.0.0.8}{10.0.0.8:9300}, added {{lURUCVH}{lURUCVHCQUCfZMdngLrlmA}{KnIJ-GKBRY-OZ605RZj8uQ}{10.0.0.13}{10.0.0.13:9300},{idNEX8h}{idNEX8heSdK99ZQYgSvFHQ}{DnHTNF4TScKNM4nfEZuDzw}{10.0.0.12}{10.0.0.12:9300},{wht1qYx}{wht1qYxoRuKMDrrbNQXFnA}{TV-HF6hkRuKoc9yqNHvVtg}{10.0.0.9}{10.0.0.9:9300},}, reason: zen-disco-elected-as-master ([3] nodes joined)[{wht1qYx}{wht1qYxoRuKMDrrbNQXFnA}{TV-HF6hkRuKoc9yqNHvVtg}{10.0.0.9}{10.0.0.9:9300}, {lURUCVH}{lURUCVHCQUCfZMdngLrlmA}{KnIJ-GKBRY-OZ605RZj8uQ}{10.0.0.13}{10.0.0.13:9300}, {idNEX8h}{idNEX8heSdK99ZQYgSvFHQ}{DnHTNF4TScKNM4nfEZuDzw}{10.0.0.12}{10.0.0.12:9300}]
[2018-06-04T12:53:32,617][INFO ][o.e.c.s.ClusterSettings  ] [70eRJrE] updating [cluster.routing.allocation.enable] from [ALL] to [none]

The primary shards progressively start again, eventually all succeeding:

Number of STARTED/RELOCATING shards by node at 2018-06-04T12:54:54,339
70eRJrE_T7SMoPLC11i5Kg 106/0
7sNEZsbNSU-CxzG_64n1CQ 0/0
Y0EJjkYjS3aRj5YDJvStMQ 86/0
idNEX8heSdK99ZQYgSvFHQ 150/0
lURUCVHCQUCfZMdngLrlmA 155/0
wht1qYxoRuKMDrrbNQXFnA 155/0
zlHb2gqcSjqaO2Abl4bpZQ 62/0
total                  714/0

(This is what you describe in the OP as "after disabling the allocation" - in fact you've also restarted the whole cluster too, which makes quite a big difference)

At this point you set cluster.routing.allocation.enable: null to re-enable allocation:

[2018-06-04T13:26:05,223][INFO ][o.e.c.s.ClusterSettings  ] [70eRJrE] updating [cluster.routing.allocation.enable] from [none] to [all]

The cluster completely recovers all 1428 shards without any relocations:

Number of STARTED/RELOCATING shards by node at 2018-06-04T13:33:31,975
70eRJrE_T7SMoPLC11i5Kg 203/0
7sNEZsbNSU-CxzG_64n1CQ 206/0
Y0EJjkYjS3aRj5YDJvStMQ 204/0
idNEX8heSdK99ZQYgSvFHQ 202/0
lURUCVHCQUCfZMdngLrlmA 204/0
wht1qYxoRuKMDrrbNQXFnA 203/0
zlHb2gqcSjqaO2Abl4bpZQ 206/0
total                  1428/0

To be continued...

... continued:

However, note that the shard counts are now imbalanced, so some of the shards are shuffled around to balance things out:

Number of STARTED/RELOCATING shards by node at 2018-06-04T13:37:00,231
70eRJrE_T7SMoPLC11i5Kg 204/0
7sNEZsbNSU-CxzG_64n1CQ 204/0
Y0EJjkYjS3aRj5YDJvStMQ 204/0
idNEX8heSdK99ZQYgSvFHQ 204/0
lURUCVHCQUCfZMdngLrlmA 204/0
wht1qYxoRuKMDrrbNQXFnA 204/0
zlHb2gqcSjqaO2Abl4bpZQ 204/0
total                  1428/0

This obviously raises the question of why the shard counts are imbalanced after recovery, requiring rebalancing. One reason for this is that there is still data being written to the cluster after nodes started to shut down, which led to some of the replicas being marked as stale:

[2018-06-04T08:51:50,959][WARN ][o.e.c.a.s.ShardStateAction] [70eRJrE] [.monitoring-data-2][0] received shard failed for shard id [[.monitoring-data-2][0]], allocation id [_eFnin88SQ6BUXY1l1MtJg], primary term [6], message [mark copy as stale]
[2018-06-04T12:46:22,457][WARN ][o.e.c.a.s.ShardStateAction] [70eRJrE] [.monitoring-es-2-2018.06.04][0] received shard failed for shard id [[.monitoring-es-2-2018.06.04][0]], allocation id [UUBMAwzRRwCEJQ7ENQ5qhg], primary term [2], message [mark copy as stale]
[2018-06-04T12:46:22,466][WARN ][o.e.c.a.s.ShardStateAction] [70eRJrE] [.monitoring-es-2-2018.06.04][0] received shard failed for shard id [[.monitoring-es-2-2018.06.04][0]], allocation id [UUBMAwzRRwCEJQ7ENQ5qhg], primary term [2], message [mark copy as stale]
[2018-06-04T12:46:22,469][WARN ][o.e.c.a.s.ShardStateAction] [70eRJrE] [.monitoring-es-2-2018.06.04][0] received shard failed for shard id [[.monitoring-es-2-2018.06.04][0]], allocation id [UUBMAwzRRwCEJQ7ENQ5qhg], primary term [2], message [mark copy as stale]
[2018-06-04T12:46:22,474][WARN ][o.e.c.a.s.ShardStateAction] [70eRJrE] [.monitoring-es-2-2018.06.04][0] received shard failed for shard id [[.monitoring-es-2-2018.06.04][0]], allocation id [UUBMAwzRRwCEJQ7ENQ5qhg], primary term [2], message [mark copy as stale]
[2018-06-04T12:46:22,490][WARN ][o.e.c.a.s.ShardStateAction] [70eRJrE] [.monitoring-es-2-2018.06.04][0] received shard failed for shard id [[.monitoring-es-2-2018.06.04][0]], allocation id [UUBMAwzRRwCEJQ7ENQ5qhg], primary term [2], message [mark copy as stale]
[2018-06-04T12:47:29,497][WARN ][o.e.c.a.s.ShardStateAction] [70eRJrE] [.monitoring-data-2][0] received shard failed for shard id [[.monitoring-data-2][0]], allocation id [zah4oRKkQoqxOEFouSut7g], primary term [6], message [mark copy as stale]

Since these replicas were stale we could not use their existing on-disk copies, causing them to be allocated elsewhere during the recovery process.

To quote my earlier message:

It's important to follow all of the steps in the manual, including the fully-successful synced-flush and the wait-for-green at the end, and also important not to perform any indexing while the cluster is not green as this may cause replicas to require recovery.

However, this doesn't explain everything. In total I count 206 shards that were allocated differently after recovery. Let's look at the first one, [autozonebinaryobjectindex_v3][1]. Initially it's allocated with its primary on 7sNEZsbNSU-CxzG_64n1CQ and its replica on Y0EJjkYjS3aRj5YDJvStMQ.

[2018-06-04T12:12:38,688][TRACE][o.e.c.r.a.a.BalancedShardsAllocator] [70eRJrE] Assigned shard [[autozonebinaryobjectindex_v3][1], node[Y0EJjkYjS3aRj5YDJvStMQ], [R], s[STARTED], a[id=bB04wTvNQmOMMsP95cO2Ag]] to node [Y0EJjkYjS3aRj5YDJvStMQ]
[2018-06-04T12:12:38,725][TRACE][o.e.c.r.a.a.BalancedShardsAllocator] [70eRJrE] Assigned shard [[autozonebinaryobjectindex_v3][1], node[7sNEZsbNSU-CxzG_64n1CQ], [P], s[STARTED], a[id=gWiRn00mTYWkh8HJO1s1hg]] to node [7sNEZsbNSU-CxzG_64n1CQ]

After restart the primary is allocated to Y0EJjkYjS3aRj5YDJvStMQ where the replica was previously allocated:

[2018-06-04T12:54:21,666][TRACE][o.e.c.r.a.a.BalancedShardsAllocator] [70eRJrE] Assigned shard [[autozonebinaryobjectindex_v3][1], node[Y0EJjkYjS3aRj5YDJvStMQ], [P], s[STARTED], a[id=bB04wTvNQmOMMsP95cO2Ag]] to node [Y0EJjkYjS3aRj5YDJvStMQ]

When allocation is re-enabled, for some reason the replica is allocated to lURUCVHCQUCfZMdngLrlmA and not to 7sNEZsbNSU-CxzG_64n1CQ:

[2018-06-04T13:27:58,029][TRACE][o.e.c.r.a.a.BalancedShardsAllocator] [70eRJrE] Assigned shard [[autozonebinaryobjectindex_v3][1], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2018-06-04T12:53:30.740Z], delayed=false, allocation_status[no_attempt]]] to [lURUCVHCQUCfZMdngLrlmA]

This is a little unexpected and I need to spend some time investigating this further.

The first changed assignment seems to be the following:

[2018-06-04T12:12:38,682][TRACE][o.e.c.r.a.a.BalancedShardsAllocator] [70eRJrE] Assigned shard [[dataplatformeventindex_v3][1], node[zlHb2gqcSjqaO2Abl4bpZQ], [R], s[STARTED], a[id=M5sALO7qSqicSNGtQV42kg]] to node [zlHb2gqcSjqaO2Abl4bpZQ]
[2018-06-04T12:12:38,704][TRACE][o.e.c.r.a.a.BalancedShardsAllocator] [70eRJrE] Assigned shard [[dataplatformeventindex_v3][1], node[wht1qYxoRuKMDrrbNQXFnA], [P], s[STARTED], a[id=FPNHFXoVRjydwYoJlr4Eyg]] to node [wht1qYxoRuKMDrrbNQXFnA]
...
[2018-06-04T13:26:31,410][TRACE][o.e.c.r.a.a.BalancedShardsAllocator] [70eRJrE] Assigned shard [[dataplatformeventindex_v3][1], node[zlHb2gqcSjqaO2Abl4bpZQ], [P], s[STARTED], a[id=M5sALO7qSqicSNGtQV42kg]] to node [zlHb2gqcSjqaO2Abl4bpZQ]
[2018-06-04T13:26:31,417][TRACE][o.e.c.r.a.a.BalancedShardsAllocator] [70eRJrE] Assigned shard [[dataplatformeventindex_v3][1], node[70eRJrE_T7SMoPLC11i5Kg], [R], s[STARTED], a[id=K6JNwxcnQ82HD2KvA5vIag]] to node [70eRJrE_T7SMoPLC11i5Kg]

This is not obviously related to the stale shards, but the most likely explanation is simply because the cluster was not properly sync-flushed before shutdown. We could tell with TRACE logging as follows:

PUT _cluster/settings
{"persistent":{"logger.org.elasticsearch.gateway.ReplicaShardAllocator":"TRACE"}}

Thanks David for the detailed analysis

A bit of context should also help.
These nodes are docker containers, we do not have any control on how or when they would loose the network. when a container comes back based on the container name elastic1 it would find the elastic1 disk and attach to the node, similar is the case for other containers in the list namely elastic1 to 7.

Loosing network and container restarting I shall provide more details on what happened here. If its network loss how do I increase the timeout so that this tolerance is more. It does take some time before the networking layer catches up with all the containers.

As before, there is no clean shut down, just stoping container and starting which would just kill the process and start container again. Let me know if you are expecting something else.

This is the purpose of this thread. I have three requirements, weekend shutdown and start of the entire cluster for cost saving. The other is Microsoft Azure introduced which is to stop the node increase the disk and then start it again without shard movement. In AWS you can increase disk without restarting. The third is production backups of disk from one environment like a Prod to restore on a Pre Prod without Shard movement. Not the backup route which take time depending on the cluster disk size.

I shall redo this test and give the sync flush details to check if this too can be proven. I am assuming Sync flush is like asking Elastic which of your shards have stale copies and fix them. I would still recommend this to be separate one to a status API to identify which of the shards are stale and another API to fix the stale shards. This way we can check the status so that there is no ambiguity before we bring down the cluster.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.