Health check on my ES cluster shows 99.7%

drivera · February 27, 2019, 12:49am

After unsuccessfully trying to backup my cluster, I bounced all my nodes and now I see the error below. Any ideas? Events are still coming but I'm concerned.

1551228466 16:47:46 istunixes yellow 9 5 2087 1047 0 0 7 0 - 99.7%

DavidTurner · February 27, 2019, 9:33am

Is the cluster still in this state? Are there ongoing recoveries? When you restart nodes it can take some time to recover all the shards. Check the cluster allocation explain API if there are no ongoing recoveries and yet some shards are not allocated.

drivera · February 27, 2019, 6:07pm

Hi David, the cluster is still in the same state. Read over the docs in the link you sent but I don't see the API response for my condition. I have included it below, sounds it may be an easy fix, in the end if I have to wipe that index, that is okay too. See the output of the GET /_cluster/allocation/explain API:

{
"index": "syslog-rsyslog-2019-02-27",
"shard": 0,
"primary": true,
"current_state": "started",
"current_node": {
"id": "AAtmertMTeiuE_i5CQTKAA",
"name": "ist000248",
"transport_address": "10.44.0.180:9300",
"weight_ranking": 4
},
"can_remain_on_current_node": "yes",
"can_rebalance_cluster": "no",
"can_rebalance_cluster_decisions": [
{
"decider": "cluster_rebalance",
"decision": "NO",
"explanation": "the cluster has unassigned shards and cluster setting [cluster.routing.allocation.allow_rebalance] is set to [indices_all_active]"
}
],
"can_rebalance_to_other_node": "no",
"rebalance_explanation": "rebalancing is not allowed, even though there is at least one node on which the shard can be allocated",
"node_allocation_decisions": [
{
"node_id": "YQXjt41cQrK7vDHuNXubZA",
"node_name": "ist000242",
"transport_address": "10.44.3.8:9300",
"node_decision": "yes",
"weight_ranking": 1
},
{
"node_id": "ihHNmtfqTGSugWowHxCILw",
"node_name": "ist000243",
"transport_address": "10.44.2.206:9300",
"node_decision": "yes",
"weight_ranking": 2
},
{
"node_id": "H8dbQ6dnQtSIOadYeqGz6w",
"node_name": "ist000245",
"transport_address": "10.44.0.157:9300",
"node_decision": "yes",
"weight_ranking": 3
},
{
"node_id": "fCfm_WQXTgeG52ca37Th1A",
"node_name": "ist000247",
"transport_address": "10.44.0.188:9300",
"node_decision": "no",
"weight_ranking": 4,
"deciders": [
{
"decider": "same_shard",
"decision": "NO",
"explanation": "the shard cannot be allocated to the same node on which a copy of the shard already exists [[syslog-rsyslog-2019-02-27][0], node[fCfm_WQXTgeG52ca37Th1A], [R], s[STARTED], a[id=p4ECJ365RV2gAzmvJWbp6g]]"
}
]

drivera · February 27, 2019, 6:19pm

Posting my current cluster settings:

{
"persistent": {
"cluster": {
"routing": {
"allocation": {
"cluster_concurrent_rebalance": "4",
"node_concurrent_recoveries": "4",
"disk": {
"watermark": {
"low": "90%",
"high": "95%"
}
},
"enable": "all"
}
}
},
"logger": {
"_root": "INFO"
}
},
"transient": {}
}

DavidTurner · February 28, 2019, 9:08am

drivera:

See the output of the GET /_cluster/allocation/explain API:

{
"index": "syslog-rsyslog-2019-02-27",
"shard": 0,
"primary": true,
"current_state": "started",
"current_node": {
"id": "AAtmertMTeiuE_i5CQTKAA",
"name": "ist000248",
"transport_address": "10.44.0.180:9300",

Hmm, that's strange, I thought GET _cluster/allocation/explain only shows you information about an unassigned shard, but "current_state": "started" tells us that this shard is fine.

You will need to ask it a more specific question. Find an unassigned shard with GET _cat/shards and then use this form of the allocation explain API:

GET /_cluster/allocation/explain
{
  "index": "myindex",
  "shard": 0,
  "primary": true
}

drivera · February 28, 2019, 7:07pm

Looks like I have 7 unassigned shards. Running the explain API on the shard shows the output below. I notice, one of my data nodes is not in that list, I should have 5 but only see 4. Any ideas?

{
"index": "syslog-rsyslog-2019-02-25",
"shard": 3,
"primary": true,
"current_state": "started",
"current_node": {
"id": "fCfm_WQXTgeG52ca37Th1A",
"name": "ist000247",
"transport_address": "10.44.0.188:9300",
"weight_ranking": 2
},
"can_remain_on_current_node": "yes",
"can_rebalance_cluster": "no",
"can_rebalance_cluster_decisions": [
{
"decider": "rebalance_only_when_active",
"decision": "NO",
"explanation": "rebalancing is not allowed until all replicas in the cluster are active"
},
{
"decider": "cluster_rebalance",
"decision": "NO",
"explanation": "the cluster has unassigned shards and cluster setting [cluster.routing.allocation.allow_rebalance] is set to [indices_all_active]"
}
],
"can_rebalance_to_other_node": "no",
"rebalance_explanation": "rebalancing is not allowed, even though there is at least one node on which the shard can be allocated",
"node_allocation_decisions": [
{
"node_id": "AAtmertMTeiuE_i5CQTKAA",
"node_name": "ist000248",
"transport_address": "10.44.0.180:9300",
"node_decision": "yes",
"weight_ranking": 1
},
{
"node_id": "H8dbQ6dnQtSIOadYeqGz6w",
"node_name": "ist000245",
"transport_address": "10.44.0.157:9300",
"node_decision": "worse_balance",
"weight_ranking": 2
},
{
"node_id": "YQXjt41cQrK7vDHuNXubZA",
"node_name": "ist000242",
"transport_address": "10.44.3.8:9300",
"node_decision": "worse_balance",
"weight_ranking": 2
},
{
"node_id": "ihHNmtfqTGSugWowHxCILw",
"node_name": "ist000243",
"transport_address": "10.44.2.206:9300",
"node_decision": "worse_balance",
"weight_ranking": 3
}
]
}

DavidTurner · March 1, 2019, 7:24am

It looks like this shard has a correctly allocated primary but is perhaps missing a replica? If so, you need to call the allocation explain API with "primary": false to find out about the shards other than the primary.

drivera · March 1, 2019, 10:43pm

David, much obliged for your help. I was able to fix my cluster by doing the following steps:

Set replicas in bad indices to 0
PUT _cluster/settings
{
"persistent": {
"cluster.routing.allocation.enable": "none"
}
}
POST _flush/synced
Stop ES on master and data nodes
Start ES on the master nodes first
Start ES on the data nodes second
Change “cluster.routing.allocation.enable”: “all”
Enable replica shards in bad indices to 1

I'm back to green!

system · March 29, 2019, 10:43pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch Cluster Status is RED Elasticsearch elastic-stack-monitoring	12	712	June 29, 2021
Unable to recover my cluster Elasticsearch	13	510	June 7, 2023
Shard rebalance issue Elasticsearch	5	1452	August 20, 2019
Cluster shards disbalance Elasticsearch elastic-stack-monitoring	4	185	February 7, 2024
Cluster State Red after node restart Elasticsearch	2	343	October 7, 2019

Health check on my ES cluster shows 99.7%

Related topics