Shards unassigned for .kibana_task_manager index in cluster

Hi,

running 3 nodes cluster in prod workload, having problem with .kibana_task_manager index in red state, ran cluster/allocation api came to know 2 of the shards unassigned here is the response.

for workaround this can be delete and its get recreated when kibana restarted but looking for permanent solution which does not reoccur in future.

am trying reproduce this error in lower but we are unable do this. Can someone help me in this regard.

{
"index":".kibana_task_manager",
"shard":0,
"primary":true,
"current_state":"unassigned",
"unassigned_info":{
"reason":"NODE_LEFT",
"at":"2020-11-01T21:00:43.758Z",
"details":"node_left [pyMkwKrJR0-XXhVhR1D8Sw]",
"last_allocation_status":"no_valid_shard_copy"
},
"can_allocate":"no_valid_shard_copy",
"allocate_explanation":"cannot allocate because a previous copy of the primary shard existed but can no longer be found on the nodes in the cluster",
"node_allocation_decisions":[
{
"node_id":"prod3",
"node_name":"prod3",
"transport_address":"...:9300",
"node_attributes":{
"ml.machine_memory":"16170143744",
"xpack.installed":"true",
"ml.max_open_jobs":"20",
"ml.enabled":"true"
},
"node_decision":"no",
"store":{
"found":false
}
},
{
"node_id":"prod1",
"node_name":"prod1",
"transport_address":"...:9300",
"node_attributes":{
"ml.machine_memory":"16346312704",
"ml.max_open_jobs":"20",
"xpack.installed":"true",
"ml.enabled":"true"
},
"node_decision":"no",
"store":{
"found":false
}
},
{
"node_id":"prod2",
"node_name":"prod2",
"transport_address":"...:9300",
"node_attributes":{
"ml.machine_memory":"16170151936",
"ml.max_open_jobs":"20",
"xpack.installed":"true",
"ml.enabled":"true"
},
"node_decision":"no",
"store":{
"found":false
}
}
]
}

1 Like

Hi, when an Elasticsearch cluster has unassigned primary shards, it will go into the "red" state to tell you that something is wrong.

Use the _cat/shards API to find which shards are unassigned and why. In your case, the "why" is: cannot allocate because a previous copy of the primary shard existed but can no longer be found on the nodes in the cluster. That means that the cluster is aware that the shared existed, but now it doesn't exist. It is not normal for shards to suddenly disappear, which is probably why you are not able to reproduce the issue easily. One way to reproduce it would be:

  1. Start a cluster with 3 nodes
  2. Create an index with no replica shards and 1 document
  3. Find the node that has the primary shard for the index
  4. Disconnect that node (cluster will go red).
  5. Re-do step 2 using the same name of the index
  6. If you bring the offline node back up, clear the data directory first. That ensures the shard is gone, but the reference is still in the ES cluster state.

The cluster is aware that there was a shard for the index that used to exist and now it doesn't. That is enough for the cluster state to go to "red." But when you try to re-allocate a shard for the same index, ES won't simply allocate it, as that would make it impossible for you to recover the data from a backup.

The .kibana_task_manager index settings are to use a single shard so the cluster will be green even if it has a single node. It also has the auto_expand_replicas setting set to 0-1 so that a replica shard will be assigned if a second node is available.

> GET /.kibana_task_manager/_settings
{
  ".kibana_task_manager_2" : {
    "settings" : {
      "index" : {
        "number_of_shards" : "1",
        "auto_expand_replicas" : "0-1",
        "provided_name" : ".kibana_task_manager_2",
        "creation_date" : "1603997355585",
        "number_of_replicas" : "0",
        "uuid" : "QDXEqYv9Tb2bevroJ9l8lg",
        "version" : {
          "created" : "7090199",
          "upgraded" : "7090399"
        }
      }
    }
  }
}

The way to reproduce this problem with auto_assign_replicas is to:

  1. Start a cluster with 3 nodes
  2. Allow the .kibana_task_manager's primary and replica shards to allocate
  3. Find BOTH nodes that have the shards for the index
  4. Disconnect both nodes.
  5. Restart Kibana

Thanks for reply..!!
i did what you suggested for reproducing the .kibana_task_manager index to red, but not worked out. here is the steps i did.

  1. started the cluster with 3 nodes like dev-1,dev-2,dev-3.
  2. .kibana_task_manager is allocated on 2 nodes initially dev-1(priamary),dev-
    2(replica).
  3. disconnected(kill the es processes) the dev-1 and dev-2 from the cluster, this moment i cannot fire any commands as i will get master undiscovery exception as it need 2 nodes should up and running in 3 node cluster.
  4. restated kibana server
  5. started the dev-2 es sever

O/P: .kibana_task_manager does automatically allocated other nodes once any 2 nodes up instead of red.

Kindly suggest me if am wrong. Thanks

As Tim mentioned the root cause appears to be that your cluster has nodes which go offline. If all shard replicas disappear Elasticsearch will be unable to recover.

I'd suggest you investigate why the shard and it's replica are both going offline and try to prevent that from happening.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.