Unassigned shards -- and two shards rebalancing

My cluster has been stuck in Yellow for a couple of days. There are two shards that are being rebalanced (and have been for days ???) and this is appears to be causing the cluster to refuse to allocate replicas of primary shards of newly created indexes leading to unassigned shards.

I have used the _cluster/allocation/explain on the unassigned shards an get

{
  "index" : "arkime_history_v1-23w44",
  "shard" : 0,
  "primary" : false,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "REPLICA_ADDED",
    "at" : "2023-11-05T19:59:51.887Z",
    "last_allocation_status" : "no_attempt"
  },
  "can_allocate" : "throttled",
  "allocate_explanation" : "allocation temporarily throttled",
  "node_allocation_decisions" : [
    {
      "node_id" : "6UDagJW2T3eWM-0PQJ0rMA",
      "node_name" : "secesprd02",
      "transport_address" : "10.6.0.68:9300",
      "node_attributes" : {
        "xpack.installed" : "true",
        "transform.node" : "false",
        "molochtype" : "hot"
      },
      "node_decision" : "throttled",
      "deciders" : [
        {
          "decider" : "throttling",
          "decision" : "THROTTLE",
          "explanation" : "reached the limit of outgoing shard recoveries [2] on the node [DsJqLibJQSi9D2lIAUHOrw] which holds the primary, cluster setting [cluster.routing.allocation.node_concurrent_outgoing_recoveries=2] (can also be set via [cluster.routing.allocation.node_concurrent_recoveries])"
        }
      ]
    },

I have no idea why the reallocation process is stuck or how to find out?

here is the display from Cerebro of the affected shards:

As alway help greatly appreciated!

I ended up deleting the two indexes that were stuck replicating and the two indexes which were blocked from allocating their replicas got initialised and a few minutes later the cluster turned green. I restored the index that I really cared about from snapshot and all seems fine now.

  • the other was about to get deleted soon so I did not bother restoring it.

after a few days the situation has repeated -- with different shards. There are now two shards for older indices replicating and one new index with only primary shards allocated.

Is there some way I can find out why the rebalancing operations never complete?

What is the full output of the cluster stats API?

funny you should ask that -- I had already looked but could not see anything relevant:

{
  "_nodes" : {
    "total" : 7,
    "successful" : 7,
    "failed" : 0
  },
  "cluster_name" : "security",
  "cluster_uuid" : "CjDhDttPTLafKvn-MYx3vA",
  "timestamp" : 1699899977731,
  "status" : "yellow",
  "indices" : {
    "count" : 410,
    "shards" : {
      "total" : 907,
      "primaries" : 457,
      "replication" : 0.9846827133479212,
      "index" : {
        "shards" : {
          "min" : 1,
          "max" : 6,
          "avg" : 2.2121951219512197
        },
        "primaries" : {
          "min" : 1,
          "max" : 3,
          "avg" : 1.1146341463414635
        },
        "replication" : {
          "min" : 0.0,
          "max" : 2.0,
          "avg" : 0.9951219512195122
        }
      }
    },
    "docs" : {
      "count" : 17516536185,
      "deleted" : 386200
    },
    "store" : {
      "size_in_bytes" : 2834357442204,
      "total_data_set_size_in_bytes" : 2834357442204,
      "reserved_in_bytes" : 0
    },
    "fielddata" : {
      "memory_size_in_bytes" : 7096904,
      "evictions" : 0
    },
    "query_cache" : {
      "memory_size_in_bytes" : 2774750642,
      "total_count" : 120142267,
      "hit_count" : 7625785,
      "miss_count" : 112516482,
      "cache_size" : 138936,
      "cache_count" : 311204,
      "evictions" : 172268
    },
    "completion" : {
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 3989,
      "memory_in_bytes" : 77227488,
      "terms_memory_in_bytes" : 60893728,
      "stored_fields_memory_in_bytes" : 6901896,
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory_in_bytes" : 473408,
      "points_memory_in_bytes" : 0,
      "doc_values_memory_in_bytes" : 8958456,
      "index_writer_memory_in_bytes" : 885383432,
      "version_map_memory_in_bytes" : 4141722,
      "fixed_bit_set_memory_in_bytes" : 52857176,
      "max_unsafe_auto_id_timestamp" : 1699857643067,
      "file_sizes" : { }
    },
    "mappings" : {
      "field_types" : [
        {
          "name" : "alias",
          "count" : 15,
          "index_count" : 15,
          "script_count" : 0
        },
        {
          "name" : "binary",
          "count" : 9,
          "index_count" : 1,
          "script_count" : 0
        },
        {
          "name" : "boolean",
          "count" : 542,
          "index_count" : 34,
          "script_count" : 0
        },
        {
          "name" : "byte",
          "count" : 36,
          "index_count" : 36,
          "script_count" : 0
        },
        {
          "name" : "constant_keyword",
          "count" : 195,
          "index_count" : 65,
          "script_count" : 0
        },
        {
          "name" : "date",
          "count" : 1494,
          "index_count" : 332,
          "script_count" : 0
        },
        {
          "name" : "flattened",
          "count" : 189,
          "index_count" : 16,
          "script_count" : 0
        },
        {
          "name" : "float",
          "count" : 151,
          "index_count" : 44,
          "script_count" : 0
        },
        {
          "name" : "geo_point",
          "count" : 245,
          "index_count" : 100,
          "script_count" : 0
        },
        {
          "name" : "integer",
          "count" : 217,
          "index_count" : 120,
          "script_count" : 0
        },
        {
          "name" : "ip",
          "count" : 658,
          "index_count" : 248,
          "script_count" : 0
        },
        {
          "name" : "keyword",
          "count" : 28476,
          "index_count" : 336,
          "script_count" : 0
        },
        {
          "name" : "long",
          "count" : 3394,
          "index_count" : 97,
          "script_count" : 0
        },
        {
          "name" : "match_only_text",
          "count" : 975,
          "index_count" : 15,
          "script_count" : 0
        },
        {
          "name" : "nested",
          "count" : 239,
          "index_count" : 53,
          "script_count" : 0
        },
        {
          "name" : "object",
          "count" : 5710,
          "index_count" : 149,
          "script_count" : 0
        },
        {
          "name" : "scaled_float",
          "count" : 20,
          "index_count" : 20,
          "script_count" : 0
        },
        {
          "name" : "short",
          "count" : 22,
          "index_count" : 22,
          "script_count" : 0
        },
        {
          "name" : "text",
          "count" : 1328,
          "index_count" : 333,
          "script_count" : 0
        },
        {
          "name" : "version",
          "count" : 16,
          "index_count" : 16,
          "script_count" : 0
        },
        {
          "name" : "wildcard",
          "count" : 255,
          "index_count" : 15,
          "script_count" : 0
        }
      ],
      "runtime_field_types" : [ ]
    },
    "analysis" : {
      "char_filter_types" : [ ],
      "tokenizer_types" : [ ],
      "filter_types" : [ ],
      "analyzer_types" : [
        {
          "name" : "custom",
          "count" : 5,
          "index_count" : 5
        }
      ],
      "built_in_char_filters" : [ ],
      "built_in_tokenizers" : [
        {
          "name" : "pattern",
          "count" : 5,
          "index_count" : 5
        }
      ],
      "built_in_filters" : [
        {
          "name" : "lowercase",
          "count" : 5,
          "index_count" : 5
        }
      ],
      "built_in_analyzers" : [
        {
          "name" : "simple",
          "count" : 23,
          "index_count" : 23
        }
      ]
    },
    "versions" : [
      {
        "version" : "7.10.0",
        "index_count" : 67,
        "primary_shard_count" : 87,
        "total_primary_bytes" : 29705781489
      },
      {
        "version" : "7.14.0",
        "index_count" : 40,
        "primary_shard_count" : 41,
        "total_primary_bytes" : 42016800747
      },
      {
        "version" : "7.16.2",
        "index_count" : 31,
        "primary_shard_count" : 31,
        "total_primary_bytes" : 7558122862
      },
      {
        "version" : "7.17.1",
        "index_count" : 198,
        "primary_shard_count" : 208,
        "total_primary_bytes" : 723554581330
      },
      {
        "version" : "7.17.12",
        "index_count" : 74,
        "primary_shard_count" : 90,
        "total_primary_bytes" : 732034393709
      }
    ]
  },
  "nodes" : {
    "count" : {
      "total" : 7,
      "coordinating_only" : 0,
      "data" : 3,
      "data_cold" : 4,
      "data_content" : 0,
      "data_frozen" : 0,
      "data_hot" : 3,
      "data_warm" : 5,
      "ingest" : 3,
      "master" : 3,
      "ml" : 0,
      "remote_cluster_client" : 0,
      "transform" : 0,
      "voting_only" : 0
    },
    "versions" : [
      "7.17.12"
    ],
    "os" : {
      "available_processors" : 48,
      "allocated_processors" : 48,
      "names" : [
        {
          "name" : "Linux",
          "count" : 7
        }
      ],
      "pretty_names" : [
        {
          "pretty_name" : "Ubuntu 18.04.6 LTS",
          "count" : 1
        },
        {
          "pretty_name" : "Ubuntu 20.04.4 LTS",
          "count" : 1
        },
        {
          "pretty_name" : "Ubuntu 18.04.5 LTS",
          "count" : 1
        },
        {
          "pretty_name" : "Ubuntu 20.04.6 LTS",
          "count" : 4
        }
      ],
      "architectures" : [
        {
          "arch" : "amd64",
          "count" : 7
        }
      ],
      "mem" : {
        "total_in_bytes" : 261068275712,
        "free_in_bytes" : 9252003840,
        "used_in_bytes" : 251816271872,
        "free_percent" : 4,
        "used_percent" : 96
      }
    },
    "process" : {
      "cpu" : {
        "percent" : 5
      },
      "open_file_descriptors" : {
        "min" : 549,
        "max" : 1871,
        "avg" : 1383
      }
    },
    "jvm" : {
      "max_uptime_in_millis" : 2606568793,
      "versions" : [
        {
          "version" : "20.0.2",
          "vm_name" : "OpenJDK 64-Bit Server VM",
          "vm_version" : "20.0.2+9-78",
          "vm_vendor" : "Oracle Corporation",
          "bundled_jdk" : true,
          "using_bundled_jdk" : true,
          "count" : 7
        }
      ],
      "mem" : {
        "heap_used_in_bytes" : 66316144488,
        "heap_max_in_bytes" : 176093659136
      },
      "threads" : 685
    },
    "fs" : {
      "total_in_bytes" : 22301996646400,
      "free_in_bytes" : 17115490717696,
      "available_in_bytes" : 15999270211584
    },
    "plugins" : [ ],
    "network_types" : {
      "transport_types" : {
        "security4" : 7
      },
      "http_types" : {
        "security4" : 7
      }
    },
    "discovery_types" : {
      "zen" : 7
    },
    "packaging_types" : [
      {
        "flavor" : "default",
        "type" : "deb",
        "count" : 7
      }
    ],
    "ingest" : {
      "number_of_pipelines" : 11,
      "processor_stats" : {
        "conditional" : {
          "count" : 329757630,
          "failed" : 1,
          "current" : 0,
          "time_in_millis" : 10298594
        },
        "convert" : {
          "count" : 54807394,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 874436
        },
        "date" : {
          "count" : 109585916,
          "failed" : 109479495,
          "current" : 0,
          "time_in_millis" : 3683666
        },
        "foreach" : {
          "count" : 2995,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 52
        },
        "geoip" : {
          "count" : 109585916,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 4164645
        },
        "gsub" : {
          "count" : 43308,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 915
        },
        "join" : {
          "count" : 14436,
          "failed" : 12260,
          "current" : 0,
          "time_in_millis" : 332
        },
        "kv" : {
          "count" : 28872,
          "failed" : 877,
          "current" : 0,
          "time_in_millis" : 114
        },
        "lowercase" : {
          "count" : 219186268,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 2744081
        },
        "remove" : {
          "count" : 109675178,
          "failed" : 54686537,
          "current" : 0,
          "time_in_millis" : 1194894
        },
        "rename" : {
          "count" : 1973150710,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 1496182
        },
        "script" : {
          "count" : 219171832,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 8601853
        },
        "set" : {
          "count" : 219354049,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 4127164
        },
        "set_security_user" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "split" : {
          "count" : 14436,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 338
        },
        "uppercase" : {
          "count" : 14436,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 316
        },
        "user_agent" : {
          "count" : 71831,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 1596
        }
      }
    }
  }
}

No, I do not see anything unusual or suspicious there either.

Is there anything in the logs that indicate long or slow GC, issues with propagating cluster state or any other issue? Are the hot nodes, which also seem to act as master nodes heavily loaded?

found the problem.

A few weeks ago I added two new nodes to the cluster and failed to update the firewall on one of the existing nodes so connections on 9300 were blocked.

I had looked at the logs for several other cluster members but not the two new ones ( DOH!)

What made it more difficult to diagnose was that the problem did not appear until the first time the cluster tried to move a shard from one of the new nodes -- that was at least a week later

Moral of the story is that when you have weird things happening on the cluster you need to check the logs of all cluster member.

Thanks again for your help @Christian_Dahlqvist

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.