Each Concurrent shards batch takes more than an hour to get allocated after the node left

shashank_moengage · July 14, 2016, 2:07am

Elasticsearch version: "version" :

 {
"number" : "1.4.1",
"build_hash" : "89d3241d670db65f994242c8e8383b169779e2d4",
"build_timestamp" : "2014-11-26T15:49:29Z",
"build_snapshot" : false,
"lucene_version" : "4.10.2"
},

JVM version:
> java version "1.8.0_40"
> Java(TM) SE Runtime Environment (build 1.8.0_40-b25)
> Java HotSpot(TM) 64-Bit Server VM (build 25.40-b25, mixed mode)

OS version:> Linux Ubuntu

Description of the problem including expected versus actual behavior:

When the node leaves the cluster, a lot of shards are stuck at initializing state.

Steps to reproduce:

1. Disable allocation
2. Stop elasticsearch
3. enable allocation

Provide logs (if relevant):

Here is the response of the _cat/pending_tasks | head

> 01439777 56.5m URGENT shard-started ([error-newsflickss][0], node[xirsEXbZSpqldjylVrwzjw], [R], s[INITIALIZING]), reason [after recovery (replica) from node [[30GB_1TB_ComputeNodeNew13][4DtklsTUSd-eRhPG_c3uyw][ip-172-31-37-168][inet[/172.31.37.168:9300]]{master=false}]]

> 1439778 56.5m URGENT shard-started ([firstcrytest-notificationclickedmoe][0], node[xirsEXbZSpqldjylVrwzjw], [R], s[INITIALIZING]), reason [after recovery (replica) from node [[ES_r3_xlarge_1TB_Node_new15][sbwLxGR6RpmFHmHv3Vc6ag][ip-172-31-47-74][inet[/172.31.47.74:9300]]{master=false}]]

> 1439779 56.5m URGENT shard-started ([cleartrip-eamgc][0], node[Nm_bLbaGQWC0u0ofkIjWIw], [R], s[INITIALIZING]), reason [after recovery (replica) from node [[ES_r3_xlarge_1TB_Node_2][xirsEXbZSpqldjylVrwzjw][ip-172-31-41-73][inet[/172.31.41.73:9300]]{master=false}]]

> 1439783 56.5m URGENT shard-started ([emt-uat-fundtransfer][0], node[Nm_bLbaGQWC0u0ofkIjWIw], [R], s[INITIALIZING]), reason [after recovery (replica) from node [[30GB_1TB_ComputeNode11][jpNMJUlwQ7aQKk-hlcWLVQ][ip-172-31-40-207][inet[/172.31.40.207:9300]]{master=false}]]

> 1439822 56.4m URGENT shard-started ([firstcrytest-notificationclickedmoe][0], node[xirsEXbZSpqldjylVrwzjw], [R], s[INITIALIZING]), reason [master [ESMasterNode3][8E9mg0rZSHKITLFvvDDT2g][ip-172-31-46-130][inet[/172.31.46.130:9300]]{data=false, master=true} marked shard as initializing, but shard state is [POST_RECOVERY], mark shard as started]

> 1439780 56.5m URGENT shard-started ([chillr-requestshowqr][0], node[xirsEXbZSpqldjylVrwzjw], [R], s[INITIALIZING]), reason [after recovery (replica) from node [[30GB_1TB_ComputeNode11][jpNMJUlwQ7aQKk-hlcWLVQ][ip-172-31-40-207][inet[/172.31.40.207:9300]]{master=false}]]

> 1439799 56.4m URGENT shard-started ([sdsellerzone-catalogpdpback][0], node[Nm_bLbaGQWC0u0ofkIjWIw], [R], s[INITIALIZING]), reason [master [ESMasterNode3][8E9mg0rZSHKITLFvvDDT2g][ip-172-31-46-130][inet[/172.31.46.130:9300]]{data=false, master=true} marked shard as initializing, but shard state is [POST_RECOVERY], mark shard as started]

> 1439833 56.4m URGENT reroute_after_cluster_update_settings

> 1439824 56.4m URGENT shard-started ([cleartripprod-fbtan][0], node[xirsEXbZSpqldjylVrwzjw], [R], s[INITIALIZING]), reason [master [ESMasterNode3][8E9mg0rZSHKITLFvvDDT2g][ip-172-31-46-130][inet[/172.31.46.130:9300]]{data=false, master=true} marked shard as initializing, but shard state is [POST_RECOVERY], mark shard as started]

> 1439288 5.9h HIGH refresh-mapping [cleartripprod-apppl-2016-06-30][[datapoints]]

_cat/health

epoch timestamp cluster status node.total node.data shards pri relo init unassign

1468455843 00:24:03 DataPointsCluster yellow 22 19 26619 14017 0 20 1395

_cluster/settings

{
"persistent": {
"cluster": {
"routing": {
"allocation": {
"cluster_concurrent_rebalance": "10",
"node_concurrent_recoveries": "14",
"node_initial_primaries_recoveries": "4",
"enable": "all"
}
}
},
"threadpool": {
"bulk": {
"keep_alive": "2m",
"size": "16",
"queue_size": "2000",
"type": "fixed"
}
},
"indices": {
"recovery": {
"concurrent_streams": "6",
"max_bytes_per_sec": "120mb"
}
}
},
"transient": {
"cluster": {
"routing": {
"allocation": {
"node_initial_primaries_recoveries": "10",
"balance": {
"index": "0.80f"
},
"enable": "all",
"allow_rebalance": "indices_all_active",
"cluster_concurrent_rebalance": "0",
"node_concurrent_recoveries": "5",
"exclude": {
"_ip": "172.31.39.58"
}
}
}
},
"indices": {
"recovery": {
"concurrent_streams": "10"
}
}
}
}
Is this because of too many shards?

shashank_moengage · July 14, 2016, 5:03am

Here is the stacktrace for master hottest thread

100.1% (500.5ms out of 500ms) cpu usage by thread 'elasticsearch[ESMasterNode3][clusterService#updateTask][T#1]'
     7/10 snapshots sharing following 28 elements
       org.elasticsearch.common.collect.UnmodifiableListIterator.<init>(UnmodifiableListIterator.java:34)
       org.elasticsearch.common.collect.AbstractIndexedListIterator.<init>(AbstractIndexedListIterator.java:68)
       org.elasticsearch.common.collect.Iterators$11.<init>(Iterators.java:1058)
       org.elasticsearch.common.collect.Iterators.forArray(Iterators.java:1058)
       org.elasticsearch.common.collect.RegularImmutableList.listIterator(RegularImmutableList.java:106)
       org.elasticsearch.common.collect.ImmutableList.listIterator(ImmutableList.java:344)
       org.elasticsearch.common.collect.ImmutableList.iterator(ImmutableList.java:340)
       org.elasticsearch.cluster.routing.IndexShardRoutingTable.iterator(IndexShardRoutingTable.java:172)
       org.elasticsearch.cluster.routing.IndexShardRoutingTable.iterator(IndexShardRoutingTable.java:45)
       org.elasticsearch.cluster.routing.IndexShardRoutingTable.shardsWithState(IndexShardRoutingTable.java:509)
       org.elasticsearch.cluster.routing.IndexRoutingTable.shardsWithState(IndexRoutingTable.java:268)
       org.elasticsearch.cluster.routing.RoutingTable.shardsWithState(RoutingTable.java:114)
       org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDecider.sizeOfRelocatingShards(DiskThresholdDecider.java:225)
       org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDecider.canAllocate(DiskThresholdDecider.java:288)
       org.elasticsearch.cluster.routing.allocation.decider.AllocationDeciders.canAllocate(AllocationDeciders.java:74)
       org.elasticsearch.cluster.routing.allocation.allocator.BalancedShardsAllocator$Balancer.tryRelocateShard(BalancedShardsAllocator.java:799)
       org.elasticsearch.cluster.routing.allocation.allocator.BalancedShardsAllocator$Balancer.balance(BalancedShardsAllocator.java:426)
       org.elasticsearch.cluster.routing.allocation.allocator.BalancedShardsAllocator.rebalance(BalancedShardsAllocator.java:124)
       org.elasticsearch.cluster.routing.allocation.allocator.ShardsAllocators.rebalance(ShardsAllocators.java:81)
       org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:226)
       org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:160)
       org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:146)
       org.elasticsearch.action.admin.cluster.settings.TransportClusterUpdateSettingsAction$1$1.execute(TransportClusterUpdateSettingsAction.java:169)
       org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:329)
       org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:153)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
       java.lang.Thread.run(Thread.java:745)

dadoonet · July 14, 2016, 5:20am

Please format your code.

It will take time to rebalance all the shards on the cluster.

How many shards and nodes do you have?

You should upgrade to 1.7 as many improvements have been done since then about recovering.

shashank_moengage · July 14, 2016, 10:58am

There are 22 nodes, 19 data nodes and 3 master only nodes.

shashank_moengage · July 14, 2016, 10:59am

Here is the response of _cat/health

epoch timestamp cluster status node.total node.data shards pri relo init unassign 
1468455843 00:24:03 DataPointsCluster yellow 22 19 26619 14017 0 20 1395

dadoonet · July 14, 2016, 12:01pm

So ? Around 1000 shards per node?
That's a lot IMO.

You should reduce the pressure by either reducing the number of shards or increasing the number of nodes or both.

And upgrade!

shashank_moengage · July 14, 2016, 1:48pm

The size of large shards are very less, the max shard size is around 10Gb. Any cluster level operation is taking around 1.5 hours? I cannot even delete the indices. So, the only option left is upgrade...

dadoonet · July 14, 2016, 5:09pm

But when a Node is stopped you have to balance a LOT of shards... Which is taking a lot of time IMO.
If you have small shards, why so many? Are you using default index settings? 5 shards, 1 replica?

May be you should change that to reduce the pressure on each node?

Not an exact comparison but having 1000 shards is like having 1000 MySQL instances running on the same physical machine. Again, not exact but you get the idea. A rule of thumb is to have a number of shards per machine equivalent to the number of cores (or with a small multiplier)...

But may be others have different thoughts.

Topic		Replies	Views
Shards fail to reallocate Elasticsearch	6	585	July 6, 2017
Shard Allocation Problem Elasticsearch	3	732	July 6, 2017
Unassigned Shards Elasticsearch	11	887	July 6, 2017
Shard Allocation Problem Elasticsearch	3	341	July 6, 2017
Trying to optimize configuration for better cluster restart/recovery Elasticsearch	8	648	July 6, 2017

Each Concurrent shards batch takes more than an hour to get allocated after the node left

Here is the response of the _cat/pending_tasks | head

_cat/health

_cluster/settings

Related topics