Long time pending cluster_reroute (async_shard_fetch) task

mint · October 10, 2016, 4:49am

We have 5 node ES 2.3.2 cluster running for several months, and about 1600 shards / total 500G of data.

We currently notice that the cluster status is red duo to one index has one unassigned shard, also the cluster has one pending task.

Shards of this index, configured as 5 shards and no replica:

online_user_20160926         1 p STARTED    35838300  12.6gb 10.10.35.16 node-2 
online_user_20160926         4 p STARTED    35868944  13.5gb 10.10.35.14 node-0 
online_user_20160926         2 p STARTED    35840639  12.6gb 10.10.35.15 node-1 
online_user_20160926         3 p UNASSIGNED                                     
online_user_20160926         0 p STARTED    35833339  13.7gb 10.10.8.78  node-4

This index is readonly right now, search rate is about 200 qps/s.

Pending task:

{
  "tasks": [
    {
      "insert_order": 37729,
      "priority": "HIGH",
      "source": "cluster_reroute(async_shard_fetch)",
      "executing": true,
      "time_in_queue_millis": 1751,
      "time_in_queue": "1.7s"
    }
  ]
}

It seems like the task was fetched and re-added to the task queue. What possibly could cause the problem? >_<

P.S.
Shard allocation is enabled,

PUT /_cluster/settings
{
  "transient": {
    "cluster.routing.allocation.enable": "all"
  }
}

_cluster/health:

{
  "cluster_name": "profiling",
  "status": "red",
  "timed_out": false,
  "number_of_nodes": 5,
  "number_of_data_nodes": 5,
  "active_primary_shards": 909,
  "active_shards": 1649,
  "relocating_shards": 0,
  "initializing_shards": 0,
  "unassigned_shards": 1,
  "delayed_unassigned_shards": 0,
  "number_of_pending_tasks": 1,
  "number_of_in_flight_fetch": 0,
  "task_max_waiting_in_queue_millis": 0,
  "active_shards_percent_as_number": 99.93939393939394
}

Related logs, stuck in loop of "failed to create shard":

[2016-10-10 14:09:09,697][DEBUG][cluster.service ] [node-0] processing [shard-failed ([online_user_20160926][3], node[cuynJunpRHWQw_ztXJNNig], [P], v[6], s[INITIALIZING], a[id=0_AWIRpNSJanQOZSz4dqtQ], unassigned_info[[reason=ALLOCATION_FAILED], at[2016-10-10T06:09:03.205Z], details[failed to create shard, failure ElasticsearchException[failed to create shard]; nested: LockObtainFailedException[Can't lock shard [online_user_20160926][3], timed out after 5000ms]; ]]), message [failed to create shard]]: took 571ms done applying updated cluster_state (version: 28339, uuid: CKmRtltvREOn8oLT0F2Fwg) [2016-10-10 14:09:09,699][DEBUG][cluster.service ] [node-0] processing [cluster_reroute(async_shard_fetch)]: execute [2016-10-10 14:09:09,701][DEBUG][gateway ] [node-0] [online_user_20160926][3] found 0 allocations of [online_user_20160926][3], node[null], [P], v[7], s[UNASSIGNED], unassigned_info[[reason=ALLOCATION_FAILED], at[2016-10-10T06:09:09.127Z], details[failed to create shard, failure ElasticsearchException[failed to create shard]; nested: LockObtainFailedException[Can't lock shard [online_user_20160926][3], timed out after 5000ms]; ]], highest version: [-1] [2016-10-10 14:09:09,701][DEBUG][gateway ] [node-0] [online_user_20160926][3]: not allocating, number_of_allocated_shards_found [0] [2016-10-10 14:09:09,721][DEBUG][cluster.service ] [node-0] processing [cluster_reroute(async_shard_fetch)]: took 21ms no change in cluster_state [2016-10-10 14:09:09,721][DEBUG][cluster.service ] [node-0] processing [cluster_reroute(async_shard_fetch)]: execute [2016-10-10 14:09:09,723][DEBUG][gateway ] [node-0] [online_user_20160926][3] found 1 allocations of [online_user_20160926][3], node[null], [P], v[7], s[UNASSIGNED], unassigned_info[[reason=ALLOCATION_FAILED], at[2016-10-10T06:09:09.127Z], details[failed to create shard, failure ElasticsearchException[failed to create shard]; nested: LockObtainFailedException[Can't lock shard [online_user_20160926][3], timed out after 5000ms]; ]], highest version: [5] [2016-10-10 14:09:09,723][DEBUG][gateway ] [node-0] [online_user_20160926][3]: allocating [[online_user_20160926][3], node[null], [P], v[7], s[UNASSIGNED], unassigned_info[[reason=ALLOCATION_FAILED], at[2016-10-10T06:09:09.127Z], details[failed to create shard, failure ElasticsearchException[failed to create shard]; nested: LockObtainFailedException[Can't lock shard [online_user_20160926][3], timed out after 5000ms]; ]]] to [{node-3}{cuynJunpRHWQw_ztXJNNig}{10.10.8.77}{10.10.8.77:9300}] on primary allocation [2016-10-10 14:09:09,748][DEBUG][cluster.service ] [node-0] cluster state updated, version [28340], source [cluster_reroute(async_shard_fetch)] [2016-10-10 14:09:09,748][DEBUG][cluster.service ] [node-0] publishing cluster state version [28340]

Topic		Replies	Views
Stuck pending tasks Elasticsearch	4	2023	July 5, 2017
ES 7.0.1 : Unassigned Shards : Clarifications on how reroute API with retry_failed parameter works and its side-effects Elasticsearch	6	1493	March 20, 2020
Troubleshooting ES Resharding. Nature of immediate tasks and other questions Elasticsearch	5	728	July 6, 2017
UNASSIGNED NODE_LEFT Elasticsearch	1	224	October 23, 2023
Increasing number of pending tasks despite small number of shards Elasticsearch	4	1118	June 23, 2021

Long time pending cluster_reroute (async_shard_fetch) task

Related topics