Two data nodes: One node left, get stale shards and cluster status goes red

Two data nodes: One node left, get stale shards and cluster status goes red

Hello.

Searched for this scenario, only get the hint that this could happen in situations, but i dont know how to solve it.

We're using ES 6.8.5. We have one master-node only, and two nodes with both roles (data and master). Our indices have all one replica. When one of the data node leaves the cluster (shutting down), the cluster status goes red. As i could see, it's because the active write indexes become stale.

GET _cluster/health?pretty

{
  "cluster_name" : "graylog",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 2,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 423,
  "active_shards" : 423,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 423,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 50.0
}

.

ES master Logs:

...
[delete30wg_605][1] marking unavailable shards as stale: [IFVdNVhCRgWhRqGugLQOaQ]
[delete30wg_605][0] marking unavailable shards as stale: [7RSklzg_Twqz2pzpq1yj_Q]
[delete30wg_605][3] marking unavailable shards as stale: [7SfPvq5ySKScNuZBCcbQPQ]
[delete30wg_605][2] marking unavailable shards as stale: [rAlUNv_fQtiOWMOZwbc6sw]
...

.

GET _cat/shards?pretty

delete30fw_605     3 p STARTED     6031863    1.9gb 10.0.137.13 Fqc3okF
delete30fw_605     3 r UNASSIGNED                               
delete30fw_605     1 p STARTED     6033632      2gb 10.0.137.13 Fqc3okF
delete30fw_605     1 r UNASSIGNED                               
delete30fw_605     2 p STARTED     6040161    1.9gb 10.0.137.13 Fqc3okF
delete30fw_605     2 r UNASSIGNED                               
delete30fw_605     0 p STARTED     6035036    1.9gb 10.0.137.13 Fqc3okF
delete30fw_605     0 r UNASSIGNED                               

.

GET _cluster/allocation/explain?pretty

{
  "index" : "delete30fw_605",
  "shard" : 0,
  "primary" : true,
  "current_state" : "started",
  "current_node" : {
    "id" : "Fqc3okFAR066rkXY3lSn6Q",
    "name" : "Fqc3okF",
    "transport_address" : "10.0.137.13:9300",
    "attributes" : {
      "ml.machine_memory" : "67197956096",
      "ml.max_open_jobs" : "20",
      "xpack.installed" : "true",
      "ml.enabled" : "true"
    },
    "weight_ranking" : 1
  },
  "can_remain_on_current_node" : "yes",
  "can_rebalance_cluster" : "no",
  "can_rebalance_cluster_decisions" : [
    {
      "decider" : "rebalance_only_when_active",
      "decision" : "NO",
      "explanation" : "rebalancing is not allowed until all replicas in the cluster are active"
    },
    {
      "decider" : "cluster_rebalance",
      "decision" : "NO",
      "explanation" : "the cluster has unassigned shards and cluster setting [cluster.routing.allocation.allow_rebalance] is set to [indices_all_active]"
    }
  ],
  "can_rebalance_to_other_node" : "no",
  "rebalance_explanation" : "rebalancing is not allowed"
}

Should is set cluster.routing.allocation.allow_rebalance on "indices_primaries_active" or "always"? Does rebalacing works with only two data nodes? I would assume that the cluster status goes to yellow when one of the data-nodes fails.

Thanks

The cluster health is red so there is at least one unassigned primary shard. You need to focus your attention on that. The excerpts shared above show only assigned primaries.

Thanks for the hint. It's true. There was a build in indice in graylog which comes without replica in the last update. Now everything works.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.