ES Unassigned shards not assigning

We recently had an elasticsearch cluster that we were moving to a new environment. Not certain why, but at some point half of our nodes in an old environment shut off. this caused a bunch of shards to go unassigned, which I didnt figure was a problem. However, I'd like the current shards to just distribute themselves amongst the REMAINING nodes, as we are decomissioning the old ones.

I have been trying to use cluster allocation settings and rebalance settings, and for some reason, the shards STILL won't assign. Here are what my cluster looks like currently, and my cluster settings look like currently:

{
    "cluster_name" : "non-prod-management",
    "status" : "red",
    "timed_out" : false,
    "number_of_nodes" : 8,
    "number_of_data_nodes" : 5,
    "active_primary_shards" : 4772,
    "active_shards" : 9612,
    "relocating_shards" : 2,
    "initializing_shards" : 0,
    "unassigned_shards" : 2994,
    "delayed_unassigned_shards" : 0,
    "number_of_pending_tasks" : 0,
    "number_of_in_flight_fetch" : 0,
    "task_max_waiting_in_queue_millis" : 0,
    "active_shards_percent_as_number" : 76.24940504521656
}

{
    "persistent" : {
      "cluster" : {
        "routing" : {
          "allocation" : {
            "enable" : "all"
          }
        }
      }
    },
    "transient" : {
      "cluster" : {
        "routing" : {
          "rebalance" : {
            "enable" : "all"
          },
          "allocation" : {
            "enable" : "all",
            "allow_rebalance" : "always"
          }
        }
      }
    }
}

I am considering using the reroute api in a loop across the other shards, but only really as a last resort. Are there any other ways I can possibly get these to re-allocate, I cannot figure out why they seem to be stuck.

Any help appreciated... Thank you!!

Your cluster might be RED because you have 2 shards relocating and they may be primaries that are currently unassigned but in the process of getting assigned. Are these large shards?

How many indices do you have and what is your number of replicas setting for each?

Lastly, that's a lot of shards to distribute over just 5 nodes. Is it possible your nodes are dying from out of memory errors or something similar causing the cluster chaos?

We have 3282 indices, SOME of which range up to 70g in size, but the vast majority are under a gig. It looks like 200 of them are over 1g, and only 28 of those are over 10g. We have 5 shards per index with 2 replicas. The servers are not MASSIVE, but when I check the memory and cpu on the machines it seems reasonable, 1.5g memory free for use, so I dont think that that is the problem....

Oh it seems that number was on the master, not the slaves, one of those is hanging at 223M free memory... I am going to try adding another node to see if that helps to alleviate the load.

Yeah, in general, thats a pretty high number of shards for just 5 data nodes, prefer less shards with more data (up to a limit). Is your one node with 223M free mem doing a lot of GC?

1 Like

I am not sure, how can I check that detail??

curl -XGET 'http://localhost:9200/_nodes/stats'

There is a JVM section in there with GC stats (as well as other JVM related stats).

On just one of the slaves, I see this:

"gc" : {
      "collectors" : {
        "young" : {
          "collection_count" : 405843,
          "collection_time_in_millis" : 22200921
        },
        "old" : {
          "collection_count" : 627,
          "collection_time_in_millis" : 75766
        }
      }
}

Since earlier I have added two more nodes to the cluster and it has had basically no effect.... The memory on the slaves ranges from 500mb to 3G, so I no longer think its a memory problem...

Is there info in the logs that might give some clues as to whats happening? Those around shards not being allocated or rerouting or anything to do with recovery?