There are 5 nodes in my cluster including 1 master node and 4 data nodes. 1 replica shard.
Recently, the cluster often goes yellow because one of my data node spend too much time gcing.
When the node recovered, I found that nearly all of the replica shards are allocated to the single node.
Does anyone know how to solve the problem?
Are all your nodes running the same version of Elasticsearch? You should be able to check this through the _cat/nodes API.
all of the nodes are of version 1.5.2
You should really upgrade.
However it sounds like your cluster is, in general, overloaded. Are you able to increase the node size/count?
That most of the replicas are on one node really doesn't matter, as long as the shard count per node is pretty even.
Some one changed the setting cluster.routing.allocation.balance.primary to a none zero value. So when one of my node crashes, all the shards are reallocated to other nodes; when the crashed node comes back, replica shards start to move back. For the setting 'cluster.routing.allocation.balance.primary', more than average amount of replica shards moved back. Then the once crashed node crashed again and again.
It is a very nice suggestion to upgrade my elasticsearch version. The newer version really fixed some bugs and I will upgrade my elasticsearch recently