Delayed unassigned shards

chihccha · November 8, 2016, 8:05pm

Hi guys!

I have a question regarding how delayed_unassigned_shards works.
Our ES cluster has

5 datanodes, 5 master nodes, and 5 client nodes
we have this setting set to 5m for all our indices
"index.unassigned.node_left.delayed_timeout": "5m"
we use our cluster for log searching, so new indices are created every day.
all nodes have version 2.3.3

We noticed that every time a datanode is rebooted (e.g. in a deployment), even if it came back to join the cluster within 5 minutes, it still took a long time (2-4 hours) for the unassigned shards (~3k) to be assigned and they don't seem to be re-allocated back to the datanode that got rebooted but instead doing the whole rebalancing thing across clusters.

These are the logs from the master node when I restarted the datanode.
So we did see that the shards are delayed but even after the node (team-02001.node-data) joined back to the cluster, the timer is still counting and the shards are not being assigned back to the node immediately

[2016-11-08 18:46:52,445][INFO ][cluster.routing ] [team-02001.node-master] delaying allocation for [3532] unassigned shards, next check in [1m]
[2016-11-08 18:47:34,438][INFO ][cluster.service ] [team-02001.node-master] added {{team-02001.node-data}{fgcYfc4sR5aAUI1D9xjktw}{172.17.0.3}{172.17.0.3:9382}{ad=ad2, host=team-02001.node, max_local_storage_nodes=1, master=false},}, reason: zen-disco-join(join from node[{team-02001.node-data}{fgcYfc4sR5aAUI1D9xjktw}{172.17.0.3}{172.17.0.3:9382}{ad=ad2, host=team-02001.node, max_local_storage_nodes=1, master=false}])
[2016-11-08 18:47:54,132][INFO ][cluster.routing ] [team-02001.node-master] delaying allocation for [3447] unassigned shards, next check in [3.9m]

======== cluster health during time of delay (5 minute) =========
[chihccha@lumberjack-02001 logs]$ curl localhost:9200/_cluster/health?pretty
{
"cluster_name" : "team.ad2.r2",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 30,
"number_of_data_nodes" : 5,
"active_primary_shards" : 8830,
"active_shards" : 14222,
"relocating_shards" : 0,
"initializing_shards" : 7,
"unassigned_shards" : 3431,
"delayed_unassigned_shards" : 3364,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 80.53227633069082
}

======== cluster health after the delay time (5 minute) elaspsed =========
[chihccha@lumberjack-02001 logs]$ curl localhost:9200/_cluster/health?pretty
{
"cluster_name" : "team.ad2.r2",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 30,
"number_of_data_nodes" : 5,
"active_primary_shards" : 8830,
"active_shards" : 14227,
"relocating_shards" : 0,
"initializing_shards" : 6,
"unassigned_shards" : 3427,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 1,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 80.56058890147226
}

Using the "_cat/recovery" command, we are seeing shards being moved between different nodes across the entire cluster (from looking at the source and target host for shards).

index shard time type stage source_host target_host repository snapshot files files_percent bytes bytes_percent total_files total_bytes translog translog_percent total_translog
teamA-2016-11-08 4 4263014 replica translog n/a n/a 286 100.0% 80163560685 100.0% 286 80163560685 2107461 20.8% 10124584

Wanted to ask if this is a normal behavior?
From the documentation "https://www.elastic.co/guide/en/elasticsearch/reference/2.3/delayed-allocation.html", it sounds like the setting "index.unassigned.node_left.delayed_timeout" is supposed to help with this shard-shuffling problem.
Please let us know if there is any additional information we can provide.

Thanks,
Chelsey

warkolm · November 9, 2016, 6:55am

That's kinda excessive, you really only need 3.

How many shards do you have?!

chihccha · November 9, 2016, 5:52pm

We have 5 hosts in total, each runs a master node, data node, client node. (Thats why its 5 of everything)

There are 6470 primary shard in total each with 1 replica.

warkolm · November 9, 2016, 10:22pm

That's waaaaaaay to many and would be why things are slow. You need to reduce this.

chihccha · November 9, 2016, 11:56pm

Thanks, Mark.
Regardless of how many shards we have, I am still trying to understand the behavior.
Because I thought the delayed_unassgined _shards will help with the recovery time since shard shuffling across the cluster won't happen.
However, we are still seeing shards being rebalanced across the cluster and the recovery time is very long.

Is it because re-allocation takes some time for every shard, so before ES can re-allocate all 3k "delayed unassigned shards" back to the original datanode, the 5 minute window (delay_time) has passed, resulting in the whole cluster to start the shard-shuffling dance still?

Thanks,
Chelsey

chihccha · December 2, 2016, 11:19pm

Anyone has any ideas? Any help is appreciated!

Thanks!

system · December 30, 2016, 11:19pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Shard Rebalancing Delay Elasticsearch	3	795	July 5, 2017
Possible bug: Elasticsearch not honouring index.unassigned.node_left.delayed_timeout Elasticsearch	5	32	March 28, 2025
Unassigned shards after node leave even when node is excluded Elasticsearch	1	903	March 2, 2019
Shards unassigned after node restarts - reason: NODE_LEFT Elasticsearch	16	36415	December 28, 2016
Reduce unassigned shards allocation delay Elasticsearch	1	773	July 21, 2017

Delayed unassigned shards

Related topics