Each Concurrent shards batch takes more than an hour to get allocated after the node left

Elasticsearch version: "version" :

 {
"number" : "1.4.1",
"build_hash" : "89d3241d670db65f994242c8e8383b169779e2d4",
"build_timestamp" : "2014-11-26T15:49:29Z",
"build_snapshot" : false,
"lucene_version" : "4.10.2"
},

JVM version:
> java version "1.8.0_40"
> Java(TM) SE Runtime Environment (build 1.8.0_40-b25)
> Java HotSpot(TM) 64-Bit Server VM (build 25.40-b25, mixed mode)

OS version:> Linux Ubuntu

Description of the problem including expected versus actual behavior:

When the node leaves the cluster, a lot of shards are stuck at initializing state.

Steps to reproduce:

1. Disable allocation
2. Stop elasticsearch
3. enable allocation

Provide logs (if relevant):

Here is the response of the _cat/pending_tasks | head

> 01439777 56.5m URGENT shard-started ([error-newsflickss][0], node[xirsEXbZSpqldjylVrwzjw], [R], s[INITIALIZING]), reason [after recovery (replica) from node [[30GB_1TB_ComputeNodeNew13][4DtklsTUSd-eRhPG_c3uyw][ip-172-31-37-168][inet[/172.31.37.168:9300]]{master=false}]]

> 1439778 56.5m URGENT shard-started ([firstcrytest-notificationclickedmoe][0], node[xirsEXbZSpqldjylVrwzjw], [R], s[INITIALIZING]), reason [after recovery (replica) from node [[ES_r3_xlarge_1TB_Node_new15][sbwLxGR6RpmFHmHv3Vc6ag][ip-172-31-47-74][inet[/172.31.47.74:9300]]{master=false}]]

> 1439779 56.5m URGENT shard-started ([cleartrip-eamgc][0], node[Nm_bLbaGQWC0u0ofkIjWIw], [R], s[INITIALIZING]), reason [after recovery (replica) from node [[ES_r3_xlarge_1TB_Node_2][xirsEXbZSpqldjylVrwzjw][ip-172-31-41-73][inet[/172.31.41.73:9300]]{master=false}]]

> 1439783 56.5m URGENT shard-started ([emt-uat-fundtransfer][0], node[Nm_bLbaGQWC0u0ofkIjWIw], [R], s[INITIALIZING]), reason [after recovery (replica) from node [[30GB_1TB_ComputeNode11][jpNMJUlwQ7aQKk-hlcWLVQ][ip-172-31-40-207][inet[/172.31.40.207:9300]]{master=false}]]

> 1439822 56.4m URGENT shard-started ([firstcrytest-notificationclickedmoe][0], node[xirsEXbZSpqldjylVrwzjw], [R], s[INITIALIZING]), reason [master [ESMasterNode3][8E9mg0rZSHKITLFvvDDT2g][ip-172-31-46-130][inet[/172.31.46.130:9300]]{data=false, master=true} marked shard as initializing, but shard state is [POST_RECOVERY], mark shard as started]

> 1439780 56.5m URGENT shard-started ([chillr-requestshowqr][0], node[xirsEXbZSpqldjylVrwzjw], [R], s[INITIALIZING]), reason [after recovery (replica) from node [[30GB_1TB_ComputeNode11][jpNMJUlwQ7aQKk-hlcWLVQ][ip-172-31-40-207][inet[/172.31.40.207:9300]]{master=false}]]

> 1439799 56.4m URGENT shard-started ([sdsellerzone-catalogpdpback][0], node[Nm_bLbaGQWC0u0ofkIjWIw], [R], s[INITIALIZING]), reason [master [ESMasterNode3][8E9mg0rZSHKITLFvvDDT2g][ip-172-31-46-130][inet[/172.31.46.130:9300]]{data=false, master=true} marked shard as initializing, but shard state is [POST_RECOVERY], mark shard as started]

> 1439833 56.4m URGENT reroute_after_cluster_update_settings

> 1439824 56.4m URGENT shard-started ([cleartripprod-fbtan][0], node[xirsEXbZSpqldjylVrwzjw], [R], s[INITIALIZING]), reason [master [ESMasterNode3][8E9mg0rZSHKITLFvvDDT2g][ip-172-31-46-130][inet[/172.31.46.130:9300]]{data=false, master=true} marked shard as initializing, but shard state is [POST_RECOVERY], mark shard as started]

> 1439288 5.9h HIGH refresh-mapping [cleartripprod-apppl-2016-06-30][[datapoints]]

_cat/health

epoch timestamp cluster status node.total node.data shards pri relo init unassign

1468455843 00:24:03 DataPointsCluster yellow 22 19 26619 14017 0 20 1395

_cluster/settings

{
"persistent": {
"cluster": {
"routing": {
"allocation": {
"cluster_concurrent_rebalance": "10",
"node_concurrent_recoveries": "14",
"node_initial_primaries_recoveries": "4",
"enable": "all"
}
}
},
"threadpool": {
"bulk": {
"keep_alive": "2m",
"size": "16",
"queue_size": "2000",
"type": "fixed"
}
},
"indices": {
"recovery": {
"concurrent_streams": "6",
"max_bytes_per_sec": "120mb"
}
}
},
"transient": {
"cluster": {
"routing": {
"allocation": {
"node_initial_primaries_recoveries": "10",
"balance": {
"index": "0.80f"
},
"enable": "all",
"allow_rebalance": "indices_all_active",
"cluster_concurrent_rebalance": "0",
"node_concurrent_recoveries": "5",
"exclude": {
"_ip": "172.31.39.58"
}
}
}
},
"indices": {
"recovery": {
"concurrent_streams": "10"
}
}
}
}
Is this because of too many shards?

Here is the stacktrace for master hottest thread

100.1% (500.5ms out of 500ms) cpu usage by thread 'elasticsearch[ESMasterNode3][clusterService#updateTask][T#1]'
     7/10 snapshots sharing following 28 elements
       org.elasticsearch.common.collect.UnmodifiableListIterator.<init>(UnmodifiableListIterator.java:34)
       org.elasticsearch.common.collect.AbstractIndexedListIterator.<init>(AbstractIndexedListIterator.java:68)
       org.elasticsearch.common.collect.Iterators$11.<init>(Iterators.java:1058)
       org.elasticsearch.common.collect.Iterators.forArray(Iterators.java:1058)
       org.elasticsearch.common.collect.RegularImmutableList.listIterator(RegularImmutableList.java:106)
       org.elasticsearch.common.collect.ImmutableList.listIterator(ImmutableList.java:344)
       org.elasticsearch.common.collect.ImmutableList.iterator(ImmutableList.java:340)
       org.elasticsearch.cluster.routing.IndexShardRoutingTable.iterator(IndexShardRoutingTable.java:172)
       org.elasticsearch.cluster.routing.IndexShardRoutingTable.iterator(IndexShardRoutingTable.java:45)
       org.elasticsearch.cluster.routing.IndexShardRoutingTable.shardsWithState(IndexShardRoutingTable.java:509)
       org.elasticsearch.cluster.routing.IndexRoutingTable.shardsWithState(IndexRoutingTable.java:268)
       org.elasticsearch.cluster.routing.RoutingTable.shardsWithState(RoutingTable.java:114)
       org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDecider.sizeOfRelocatingShards(DiskThresholdDecider.java:225)
       org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDecider.canAllocate(DiskThresholdDecider.java:288)
       org.elasticsearch.cluster.routing.allocation.decider.AllocationDeciders.canAllocate(AllocationDeciders.java:74)
       org.elasticsearch.cluster.routing.allocation.allocator.BalancedShardsAllocator$Balancer.tryRelocateShard(BalancedShardsAllocator.java:799)
       org.elasticsearch.cluster.routing.allocation.allocator.BalancedShardsAllocator$Balancer.balance(BalancedShardsAllocator.java:426)
       org.elasticsearch.cluster.routing.allocation.allocator.BalancedShardsAllocator.rebalance(BalancedShardsAllocator.java:124)
       org.elasticsearch.cluster.routing.allocation.allocator.ShardsAllocators.rebalance(ShardsAllocators.java:81)
       org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:226)
       org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:160)
       org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:146)
       org.elasticsearch.action.admin.cluster.settings.TransportClusterUpdateSettingsAction$1$1.execute(TransportClusterUpdateSettingsAction.java:169)
       org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:329)
       org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:153)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
       java.lang.Thread.run(Thread.java:745)

Please format your code.

It will take time to rebalance all the shards on the cluster.

How many shards and nodes do you have?

You should upgrade to 1.7 as many improvements have been done since then about recovering.

1 Like

There are 22 nodes, 19 data nodes and 3 master only nodes.

Here is the response of _cat/health

epoch timestamp cluster status node.total node.data shards pri relo init unassign 
1468455843 00:24:03 DataPointsCluster yellow 22 19 26619 14017 0 20 1395

So ? Around 1000 shards per node?
That's a lot IMO.

You should reduce the pressure by either reducing the number of shards or increasing the number of nodes or both.

And upgrade!

The size of large shards are very less, the max shard size is around 10Gb. Any cluster level operation is taking around 1.5 hours? I cannot even delete the indices. So, the only option left is upgrade...

But when a Node is stopped you have to balance a LOT of shards... Which is taking a lot of time IMO.
If you have small shards, why so many? Are you using default index settings? 5 shards, 1 replica?

May be you should change that to reduce the pressure on each node?

Not an exact comparison but having 1000 shards is like having 1000 MySQL instances running on the same physical machine. Again, not exact but you get the idea. A rule of thumb is to have a number of shards per machine equivalent to the number of cores (or with a small multiplier)...

But may be others have different thoughts.