Delayed unassigned shards

Hi guys!

I have a question regarding how delayed_unassigned_shards works.
Our ES cluster has

  • 5 datanodes, 5 master nodes, and 5 client nodes
  • we have this setting set to 5m for all our indices
    "index.unassigned.node_left.delayed_timeout": "5m"
  • we use our cluster for log searching, so new indices are created every day.
  • all nodes have version 2.3.3

We noticed that every time a datanode is rebooted (e.g. in a deployment), even if it came back to join the cluster within 5 minutes, it still took a long time (2-4 hours) for the unassigned shards (~3k) to be assigned and they don't seem to be re-allocated back to the datanode that got rebooted but instead doing the whole rebalancing thing across clusters.

These are the logs from the master node when I restarted the datanode.
So we did see that the shards are delayed but even after the node (team-02001.node-data) joined back to the cluster, the timer is still counting and the shards are not being assigned back to the node immediately

[2016-11-08 18:46:52,445][INFO ][cluster.routing ] [team-02001.node-master] delaying allocation for [3532] unassigned shards, next check in [1m]
[2016-11-08 18:47:34,438][INFO ][cluster.service ] [team-02001.node-master] added {{team-02001.node-data}{fgcYfc4sR5aAUI1D9xjktw}{172.17.0.3}{172.17.0.3:9382}{ad=ad2, host=team-02001.node, max_local_storage_nodes=1, master=false},}, reason: zen-disco-join(join from node[{team-02001.node-data}{fgcYfc4sR5aAUI1D9xjktw}{172.17.0.3}{172.17.0.3:9382}{ad=ad2, host=team-02001.node, max_local_storage_nodes=1, master=false}])
[2016-11-08 18:47:54,132][INFO ][cluster.routing ] [team-02001.node-master] delaying allocation for [3447] unassigned shards, next check in [3.9m]

======== cluster health during time of delay (5 minute) =========
[chihccha@lumberjack-02001 logs]$ curl localhost:9200/_cluster/health?pretty
{
"cluster_name" : "team.ad2.r2",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 30,
"number_of_data_nodes" : 5,
"active_primary_shards" : 8830,
"active_shards" : 14222,
"relocating_shards" : 0,
"initializing_shards" : 7,
"unassigned_shards" : 3431,
"delayed_unassigned_shards" : 3364,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 80.53227633069082
}

======== cluster health after the delay time (5 minute) elaspsed =========
[chihccha@lumberjack-02001 logs]$ curl localhost:9200/_cluster/health?pretty
{
"cluster_name" : "team.ad2.r2",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 30,
"number_of_data_nodes" : 5,
"active_primary_shards" : 8830,
"active_shards" : 14227,
"relocating_shards" : 0,
"initializing_shards" : 6,
"unassigned_shards" : 3427,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 1,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 80.56058890147226
}

Using the "_cat/recovery" command, we are seeing shards being moved between different nodes across the entire cluster (from looking at the source and target host for shards).

index shard time type stage source_host target_host repository snapshot files files_percent bytes bytes_percent total_files total_bytes translog translog_percent total_translog
teamA-2016-11-08 4 4263014 replica translog n/a n/a 286 100.0% 80163560685 100.0% 286 80163560685 2107461 20.8% 10124584

Wanted to ask if this is a normal behavior?
From the documentation "https://www.elastic.co/guide/en/elasticsearch/reference/2.3/delayed-allocation.html", it sounds like the setting "index.unassigned.node_left.delayed_timeout" is supposed to help with this shard-shuffling problem.
Please let us know if there is any additional information we can provide.

Thanks,
Chelsey

1 Like

That's kinda excessive, you really only need 3.

How many shards do you have?!

We have 5 hosts in total, each runs a master node, data node, client node. (Thats why its 5 of everything)

There are 6470 primary shard in total each with 1 replica.

That's waaaaaaay to many and would be why things are slow. You need to reduce this.

Thanks, Mark.
Regardless of how many shards we have, I am still trying to understand the behavior.
Because I thought the delayed_unassgined _shards will help with the recovery time since shard shuffling across the cluster won't happen.
However, we are still seeing shards being rebalanced across the cluster and the recovery time is very long.

Is it because re-allocation takes some time for every shard, so before ES can re-allocate all 3k "delayed unassigned shards" back to the original datanode, the 5 minute window (delay_time) has passed, resulting in the whole cluster to start the shard-shuffling dance still?

Thanks,
Chelsey

Anyone has any ideas? Any help is appreciated!

Thanks!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.