We have 5 node ES 2.3.2 cluster running for several months, and about 1600 shards / total 500G of data.
We currently notice that the cluster status is red duo to one index has one unassigned shard, also the cluster has one pending task.
Shards of this index, configured as 5 shards and no replica:
online_user_20160926 1 p STARTED 35838300 12.6gb 10.10.35.16 node-2
online_user_20160926 4 p STARTED 35868944 13.5gb 10.10.35.14 node-0
online_user_20160926 2 p STARTED 35840639 12.6gb 10.10.35.15 node-1
online_user_20160926 3 p UNASSIGNED
online_user_20160926 0 p STARTED 35833339 13.7gb 10.10.8.78 node-4
This index is readonly right now, search rate is about 200 qps/s.
Pending task:
{
"tasks": [
{
"insert_order": 37729,
"priority": "HIGH",
"source": "cluster_reroute(async_shard_fetch)",
"executing": true,
"time_in_queue_millis": 1751,
"time_in_queue": "1.7s"
}
]
}
It seems like the task was fetched and re-added to the task queue. What possibly could cause the problem? >_<
P.S.
Shard allocation is enabled,
PUT /_cluster/settings
{
"transient": {
"cluster.routing.allocation.enable": "all"
}
}
_cluster/health:
{
"cluster_name": "profiling",
"status": "red",
"timed_out": false,
"number_of_nodes": 5,
"number_of_data_nodes": 5,
"active_primary_shards": 909,
"active_shards": 1649,
"relocating_shards": 0,
"initializing_shards": 0,
"unassigned_shards": 1,
"delayed_unassigned_shards": 0,
"number_of_pending_tasks": 1,
"number_of_in_flight_fetch": 0,
"task_max_waiting_in_queue_millis": 0,
"active_shards_percent_as_number": 99.93939393939394
}
Related logs, stuck in loop of "failed to create shard":
[2016-10-10 14:09:09,697][DEBUG][cluster.service ] [node-0] processing [shard-failed ([online_user_20160926][3], node[cuynJunpRHWQw_ztXJNNig], [P], v[6], s[INITIALIZING], a[id=0_AWIRpNSJanQOZSz4dqtQ], unassigned_info[[reason=ALLOCATION_FAILED], at[2016-10-10T06:09:03.205Z], details[failed to create shard, failure ElasticsearchException[failed to create shard]; nested: LockObtainFailedException[Can't lock shard [online_user_20160926][3], timed out after 5000ms]; ]]), message [failed to create shard]]: took 571ms done applying updated cluster_state (version: 28339, uuid: CKmRtltvREOn8oLT0F2Fwg) [2016-10-10 14:09:09,699][DEBUG][cluster.service ] [node-0] processing [cluster_reroute(async_shard_fetch)]: execute [2016-10-10 14:09:09,701][DEBUG][gateway ] [node-0] [online_user_20160926][3] found 0 allocations of [online_user_20160926][3], node[null], [P], v[7], s[UNASSIGNED], unassigned_info[[reason=ALLOCATION_FAILED], at[2016-10-10T06:09:09.127Z], details[failed to create shard, failure ElasticsearchException[failed to create shard]; nested: LockObtainFailedException[Can't lock shard [online_user_20160926][3], timed out after 5000ms]; ]], highest version: [-1] [2016-10-10 14:09:09,701][DEBUG][gateway ] [node-0] [online_user_20160926][3]: not allocating, number_of_allocated_shards_found [0] [2016-10-10 14:09:09,721][DEBUG][cluster.service ] [node-0] processing [cluster_reroute(async_shard_fetch)]: took 21ms no change in cluster_state [2016-10-10 14:09:09,721][DEBUG][cluster.service ] [node-0] processing [cluster_reroute(async_shard_fetch)]: execute [2016-10-10 14:09:09,723][DEBUG][gateway ] [node-0] [online_user_20160926][3] found 1 allocations of [online_user_20160926][3], node[null], [P], v[7], s[UNASSIGNED], unassigned_info[[reason=ALLOCATION_FAILED], at[2016-10-10T06:09:09.127Z], details[failed to create shard, failure ElasticsearchException[failed to create shard]; nested: LockObtainFailedException[Can't lock shard [online_user_20160926][3], timed out after 5000ms]; ]], highest version: [5] [2016-10-10 14:09:09,723][DEBUG][gateway ] [node-0] [online_user_20160926][3]: allocating [[online_user_20160926][3], node[null], [P], v[7], s[UNASSIGNED], unassigned_info[[reason=ALLOCATION_FAILED], at[2016-10-10T06:09:09.127Z], details[failed to create shard, failure ElasticsearchException[failed to create shard]; nested: LockObtainFailedException[Can't lock shard [online_user_20160926][3], timed out after 5000ms]; ]]] to [{node-3}{cuynJunpRHWQw_ztXJNNig}{10.10.8.77}{10.10.8.77:9300}] on primary allocation [2016-10-10 14:09:09,748][DEBUG][cluster.service ] [node-0] cluster state updated, version [28340], source [cluster_reroute(async_shard_fetch)] [2016-10-10 14:09:09,748][DEBUG][cluster.service ] [node-0] publishing cluster state version [28340]