Health check on my ES cluster shows 99.7%

After unsuccessfully trying to backup my cluster, I bounced all my nodes and now I see the error below. Any ideas? Events are still coming but I'm concerned.

1551228466 16:47:46 istunixes yellow 9 5 2087 1047 0 0 7 0 - 99.7%

Is the cluster still in this state? Are there ongoing recoveries? When you restart nodes it can take some time to recover all the shards. Check the cluster allocation explain API if there are no ongoing recoveries and yet some shards are not allocated.

Hi David, the cluster is still in the same state. Read over the docs in the link you sent but I don't see the API response for my condition. I have included it below, sounds it may be an easy fix, in the end if I have to wipe that index, that is okay too. See the output of the GET /_cluster/allocation/explain API:

{
"index": "syslog-rsyslog-2019-02-27",
"shard": 0,
"primary": true,
"current_state": "started",
"current_node": {
"id": "AAtmertMTeiuE_i5CQTKAA",
"name": "ist000248",
"transport_address": "10.44.0.180:9300",
"weight_ranking": 4
},
"can_remain_on_current_node": "yes",
"can_rebalance_cluster": "no",
"can_rebalance_cluster_decisions": [
{
"decider": "cluster_rebalance",
"decision": "NO",
"explanation": "the cluster has unassigned shards and cluster setting [cluster.routing.allocation.allow_rebalance] is set to [indices_all_active]"
}
],
"can_rebalance_to_other_node": "no",
"rebalance_explanation": "rebalancing is not allowed, even though there is at least one node on which the shard can be allocated",
"node_allocation_decisions": [
{
"node_id": "YQXjt41cQrK7vDHuNXubZA",
"node_name": "ist000242",
"transport_address": "10.44.3.8:9300",
"node_decision": "yes",
"weight_ranking": 1
},
{
"node_id": "ihHNmtfqTGSugWowHxCILw",
"node_name": "ist000243",
"transport_address": "10.44.2.206:9300",
"node_decision": "yes",
"weight_ranking": 2
},
{
"node_id": "H8dbQ6dnQtSIOadYeqGz6w",
"node_name": "ist000245",
"transport_address": "10.44.0.157:9300",
"node_decision": "yes",
"weight_ranking": 3
},
{
"node_id": "fCfm_WQXTgeG52ca37Th1A",
"node_name": "ist000247",
"transport_address": "10.44.0.188:9300",
"node_decision": "no",
"weight_ranking": 4,
"deciders": [
{
"decider": "same_shard",
"decision": "NO",
"explanation": "the shard cannot be allocated to the same node on which a copy of the shard already exists [[syslog-rsyslog-2019-02-27][0], node[fCfm_WQXTgeG52ca37Th1A], [R], s[STARTED], a[id=p4ECJ365RV2gAzmvJWbp6g]]"
}
]

Posting my current cluster settings:

{
"persistent": {
"cluster": {
"routing": {
"allocation": {
"cluster_concurrent_rebalance": "4",
"node_concurrent_recoveries": "4",
"disk": {
"watermark": {
"low": "90%",
"high": "95%"
}
},
"enable": "all"
}
}
},
"logger": {
"_root": "INFO"
}
},
"transient": {}
}

Hmm, that's strange, I thought GET _cluster/allocation/explain only shows you information about an unassigned shard, but "current_state": "started" tells us that this shard is fine.

You will need to ask it a more specific question. Find an unassigned shard with GET _cat/shards and then use this form of the allocation explain API:

GET /_cluster/allocation/explain
{
  "index": "myindex",
  "shard": 0,
  "primary": true
}

Looks like I have 7 unassigned shards. Running the explain API on the shard shows the output below. I notice, one of my data nodes is not in that list, I should have 5 but only see 4. Any ideas?

{
"index": "syslog-rsyslog-2019-02-25",
"shard": 3,
"primary": true,
"current_state": "started",
"current_node": {
"id": "fCfm_WQXTgeG52ca37Th1A",
"name": "ist000247",
"transport_address": "10.44.0.188:9300",
"weight_ranking": 2
},
"can_remain_on_current_node": "yes",
"can_rebalance_cluster": "no",
"can_rebalance_cluster_decisions": [
{
"decider": "rebalance_only_when_active",
"decision": "NO",
"explanation": "rebalancing is not allowed until all replicas in the cluster are active"
},
{
"decider": "cluster_rebalance",
"decision": "NO",
"explanation": "the cluster has unassigned shards and cluster setting [cluster.routing.allocation.allow_rebalance] is set to [indices_all_active]"
}
],
"can_rebalance_to_other_node": "no",
"rebalance_explanation": "rebalancing is not allowed, even though there is at least one node on which the shard can be allocated",
"node_allocation_decisions": [
{
"node_id": "AAtmertMTeiuE_i5CQTKAA",
"node_name": "ist000248",
"transport_address": "10.44.0.180:9300",
"node_decision": "yes",
"weight_ranking": 1
},
{
"node_id": "H8dbQ6dnQtSIOadYeqGz6w",
"node_name": "ist000245",
"transport_address": "10.44.0.157:9300",
"node_decision": "worse_balance",
"weight_ranking": 2
},
{
"node_id": "YQXjt41cQrK7vDHuNXubZA",
"node_name": "ist000242",
"transport_address": "10.44.3.8:9300",
"node_decision": "worse_balance",
"weight_ranking": 2
},
{
"node_id": "ihHNmtfqTGSugWowHxCILw",
"node_name": "ist000243",
"transport_address": "10.44.2.206:9300",
"node_decision": "worse_balance",
"weight_ranking": 3
}
]
}

It looks like this shard has a correctly allocated primary but is perhaps missing a replica? If so, you need to call the allocation explain API with "primary": false to find out about the shards other than the primary.

David, much obliged for your help. I was able to fix my cluster by doing the following steps:

  1. Set replicas in bad indices to 0
  2. PUT _cluster/settings
    {
    "persistent": {
    "cluster.routing.allocation.enable": "none"
    }
    }
  3. POST _flush/synced
  4. Stop ES on master and data nodes
  5. Start ES on the master nodes first
  6. Start ES on the data nodes second
  7. Change “cluster.routing.allocation.enable”: “all”
  8. Enable replica shards in bad indices to 1

I'm back to green!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.