Why 9 hour long shard reallocs when restarting one node in 2-node cluster?

Hi,

We've got about 8000 shards, and 10Gb of analytics data in our two-node cluster. We have one replica. The nodes are in separate datacenters (I know - not recommended) with 120ms network latency.

If we reboot one of the nodes, when it comes up the cluster is in yellow state for 9 hours, reallocating shards.

I can understand shard reallocation in a 3 node cluster where there is one replica, but what does it need to reallocate in a two node cluster? - if each node should have all the data, surely it just needs to copy over new documents that were indexed while the other server was rebooted?

Would rolling restarts help at all here?

https://www.elastic.co/guide/en/elasticsearch/guide/current/_rolling_restarts.html

Which version of Elasticsearch are you on? Do you continuously insert/update/delete documents from the indices? Are both nodes in the cluster master eligible? If so what is your minimum_master_nodes set to?

That's way too many shards.

Hi, this is ES 1.4, it's probably getting 100 inserts a second, with some TTL deletions of 90 day old documents. No updates. Both nodes are master eligible, min master is set to 1.

Elasticsearch does not transfer deltas, but rather full shards if the replicas are identified as being stale. In version 1.6 synced flush was introduced to reduce the risk of shard having to be transferred if they have not been updated. I would therefore recommend updating to ES 1.7.

Apart from that, there are a number of things about this setup that does not sound right:

  1. As Mark points out you have WAY too many shards for this size of cluster and data set. I would recommend rethinking your indexing and sharding practices to reduce this dramatically.

  2. You have deployed the cluster across two DCs with long latency between then, which is as you point out, is definately not recommended.

  3. With 2 nodes being master eligible you need to set minimum_master_nodes to 2 in order to prevent split brains and data loss. Once you do this, Elasticsearch should stop accepting writes, updates and deletes until all nodes are back and the cluster is again green. This would prevent shards from changing while a node is down and allow synced flush to reduce the number of shards/indices that will be considered stale once the node is brought back up. If you were to have a third master eligible node, which can be a small dedicated master node, one side of the cluster would be able to elect a master and continue accepting writes.

  4. Use of _ttl is also not recommended and has been deprecated in Elasticsearch 2.x.

Thanks Christian!, will aim to implement your suggestions.

Hi again,

I tried the minimum_master_nodes method, but still it takes a lot of time reassigning shards once the rebooted server comes back up.

In my two-node test cluster (fewer shards, but enough to take a minute to reassign on rebooting on of the nodes) here is what I set before rebooting one of them:

/_cluster/settings?pretty
{
"persistent" : {
"discovery" : {
"zen" : {
"minimum_master_nodes" : "2"
}
}
},
"transient" : {
"cluster" : {
"routing" : {
"allocation" : {
"enable" : "none"
}
}
}
}
}

and I have the following health:

/_cluster/health?pretty
{
"cluster_name" : "mycluster",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 160,
"active_shards" : 320,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}

I then reboot one of the nodes, and until restarted I just get the 'master not discovered' error message when attempting to run the health command, which is what I expect.

Once the other node comes up, health shows:

/_cluster/health?pretty
{
"cluster_name" : "mycluster",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 160,
"active_shards" : 160,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 160
}

note that number of unassigned_shards is 160.

then enabling shard allocation again with:
/_cluster/settings -d '{"transient" : {"cluster.routing.allocation.enable" : "all"}}'

the shards start getting assigned, slowly, 2 at a time.

This isn't what I want, when the rebooted node rejoins the cluster I want the cluster to recognize that the rejoining node already has the shards assigned to it, that nothing is changed, and that there's no need to spend all this time reallocating them. There should be no need to copy shards between servers.

Is this possible?

Can you try upgrading to a version that supports synced flush, e.g. 1.7.x, and see how it affects the situation?