Elastic Cluster Red - How to fix!

(Athanasios Antonopoulos) #1

Hi all,

I have a Elastic Cluster with 6 nodes in 6 different hosts.
The elastic cluster status is red. I am using the latest elasticsearch version 7.0.1 to all the nodes.

GET _cluster/health
{
"cluster_name" : "xh-elastic-cluster-1",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 6,
"number_of_data_nodes" : 6,
"active_primary_shards" : 547,
"active_shards" : 1028,
"relocating_shards" : 0,
"initializing_shards" : 3,
"unassigned_shards" : 67,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 93.6247723132969
}

On Monitoring on the Cluster i also see
Unassigned Shards: 16

What i have to do to fix the cluster status?

Best Regards,
Thanos

(David Turner) #2

Here's a blog post that should help you work out what's going wrong:

(Athanasios Antonopoulos) #3

Thank you very much for this Blog. I run the GET /_cluster/allocation/explain and i got

{
  "index" : "metricbeat-6.7.0-2019.05.13",
  "shard" : 0,
  "primary" : true,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "MANUAL_ALLOCATION",
    "at" : "2019-05-13T08:14:06.344Z",
    "details" : "failed shard on node [wIwQPrBRRuKRxR9_OtJxhw]: shard failure, reason [failed to recover from translog], failure EngineException[failed to recover from translog]; nested: IllegalIndexShardStateException[CurrentState[CLOSED] operation only allowed when recovering, origin [LOCAL_TRANSLOG_RECOVERY]]; ",
    "last_allocation_status" : "no_valid_shard_copy"
  },
  "can_allocate" : "no_valid_shard_copy",
  "allocate_explanation" : "cannot allocate because all found copies of the shard are either stale or corrupt"......

What can i do in such a case?

(David Turner) #4

Can you share the whole output?

(Athanasios Antonopoulos) #5

Hi,

Yesterday i found the indices with the 0 docs count and i deleted them. After this action the status of the Cluster became yellow.

Today at the morning was red again and i had to delete a red status index. After that the cluster status is:

GET _cluster/health
{
"cluster_name" : "xh-elastic-cluster-1",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 6,
"number_of_data_nodes" : 6,
"active_primary_shards" : 520,
"active_shards" : 995,
"relocating_shards" : 2,
"initializing_shards" : 0,
"unassigned_shards" : 45,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 95.67307692307693
}

GET /_cluster/allocation/explain
{
"index" : "winlogbeat-6.7.0-2019.05.03",
"shard" : 1,
"primary" : false,
"current_state" : "unassigned",
"unassigned_info" : {
"reason" : "NODE_LEFT",
"at" : "2019-05-16T07:16:59.621Z",
"details" : "node_left [6q5asfwjQ_eoI3xkl2-JXg]",
"last_allocation_status" : "no_attempt"
},
"can_allocate" : "throttled",
"allocate_explanation" : "allocation temporarily throttled",
"node_allocation_decisions" : [
{
"node_id" : "14b6D9VCR36schkKD3k74A",
"node_name" : "xh-fr-elastic-2",
"transport_address" : "135.238.239.132:9300",
"node_attributes" : {
"ml.machine_memory" : "269930725376",
"ml.max_open_jobs" : "20",
"xpack.installed" : "true"
},
"node_decision" : "throttled",
"deciders" : [
{
"decider" : "throttling",
"decision" : "THROTTLE",
"explanation" : "reached the limit of outgoing shard recoveries [2] on the node [isTX9Dk7SMSaP3GARPtU9A] which holds the primary, cluster setting [cluster.routing.allocation.node_concurrent_outgoing_recoveries=2] (can also be set via [cluster.routing.allocation.node_concurrent_recoveries])"
}
]
},
{
"node_id" : "6q5asfwjQ_eoI3xkl2-JXg",
"node_name" : "xh-gr-elastic-1",
"transport_address" : "10.158.67.175:9300",
"node_attributes" : {
"ml.machine_memory" : "16654872576",
"ml.max_open_jobs" : "20",
"xpack.installed" : "true"
},
"node_decision" : "throttled",
"deciders" : [
{
"decider" : "throttling",
"decision" : "THROTTLE",
"explanation" : "reached the limit of outgoing shard recoveries [2] on the node [isTX9Dk7SMSaP3GARPtU9A] which holds the primary, cluster setting [cluster.routing.allocation.node_concurrent_outgoing_recoveries=2] (can also be set via [cluster.routing.allocation.node_concurrent_recoveries])"
}
]
},
{
"node_id" : "OETgEHqTR9Ku30WwPWADyg",
"node_name" : "xh-gr-elastic-2",
"transport_address" : "10.159.166.9:9300",
"node_attributes" : {
"ml.machine_memory" : "269932404736",
"ml.max_open_jobs" : "20",
"xpack.installed" : "true"
},
"node_decision" : "throttled",
"deciders" : [
{
"decider" : "throttling",
"decision" : "THROTTLE",
"explanation" : "reached the limit of outgoing shard recoveries [2] on the node [isTX9Dk7SMSaP3GARPtU9A] which holds the primary, cluster setting [cluster.routing.allocation.node_concurrent_outgoing_recoveries=2] (can also be set via [cluster.routing.allocation.node_concurrent_recoveries])"
}
]
},
{
"node_id" : "_BIH8swLQz6rJa_ImpYJuA",
"node_name" : "xh-it-elastic-2",
"transport_address" : "151.98.17.34:9300",
"node_attributes" : {
"ml.machine_memory" : "8186564608",
"ml.max_open_jobs" : "20",
"xpack.installed" : "true"
},
"node_decision" : "throttled",
"deciders" : [
{
"decider" : "throttling",
"decision" : "THROTTLE",
"explanation" : "reached the limit of incoming shard recoveries [2], cluster setting [cluster.routing.allocation.node_concurrent_incoming_recoveries=2] (can also be set via [cluster.routing.allocation.node_concurrent_recoveries])"
}
]
},
{
"node_id" : "uHTHPU56QfmRVfP29OsS0Q",
"node_name" : "xh-it-elastic-1",
"transport_address" : "151.98.17.60:9300",
"node_attributes" : {
"ml.machine_memory" : "9223372036854771712",
"ml.max_open_jobs" : "20",
"xpack.installed" : "true"
},
"node_decision" : "throttled",
"deciders" : [
{
"decider" : "throttling",
"decision" : "THROTTLE",
"explanation" : "reached the limit of outgoing shard recoveries [2] on the node [isTX9Dk7SMSaP3GARPtU9A] which holds the primary, cluster setting [cluster.routing.allocation.node_concurrent_outgoing_recoveries=2] (can also be set via [cluster.routing.allocation.node_concurrent_recoveries])"
}
]
},
{
"node_id" : "isTX9Dk7SMSaP3GARPtU9A",
"node_name" : "xh-fr-elastic-1",
"transport_address" : "135.238.239.48:9300",
"node_attributes" : {
"ml.machine_memory" : "16654970880",
"ml.max_open_jobs" : "20",
"xpack.installed" : "true"
},
"node_decision" : "no",
"deciders" : [
{
"decider" : "same_shard",
"decision" : "NO",
"explanation" : "the shard cannot be allocated to the same node on which a copy of the shard already exists [[winlogbeat-6.7.0-2019.05.03][1], node[isTX9Dk7SMSaP3GARPtU9A], [P], s[STARTED], a[id=N2QSmunAQEGqKED_jnKKBA]]"
},
{
"decider" : "throttling",
"decision" : "THROTTLE",
"explanation" : "reached the limit of outgoing shard recoveries [2] on the node [isTX9Dk7SMSaP3GARPtU9A] which holds the primary, cluster setting [cluster.routing.allocation.node_concurrent_outgoing_recoveries=2] (can also be set via [cluster.routing.allocation.node_concurrent_recoveries])"
}
]
}
]
}

(David Turner) #6

This doesn't tell us anything about why your cluster health was red.