Yellow Cluster Health from unassigned_shards

Hi,

I have an yellow health cluster. We have gone through several phases to go green but none have worked so far.

get _cluster/health?pretty=true
{
  "cluster_name": "ElasticSearch",
  "status": "yellow",
  "timed_out": false,
  "number_of_nodes": 7,
  "number_of_data_nodes": 3,
  "active_primary_shards": 1481,
  "active_shards": 2924,
  "relocating_shards": 0,
  "initializing_shards": 0,
  "unassigned_shards": 38,
  "delayed_unassigned_shards": 0,
  "number_of_pending_tasks": 0,
  "number_of_in_flight_fetch": 0,
  "task_max_waiting_in_queue_millis": 0,
  "active_shards_percent_as_number": 98.71708305199189
}

First I added a 3rd data node so replica data could be on a different node.

Second I tried moving the replicas

PUT /_settings
{
    "index" : {
        "number_of_replicas" : 0
    }
}

This turned the cluster Green! but i had no data replication. and for some reason all the shards were on the same nodes (very few spread out to other nodes).

I then added the replicas back

PUT /_settings
{
    "index" : {
        "number_of_replicas" : 1
    }
}

And went from 300-400 unassigned shards to my new 38...

My Questions are:

Should a 5 shard index put all data into 1 node (I thought it would spread them out)?
Can replicas be on the same node as the shard data (I believe no just want a confirm)?
What is an unassigned shard?
How do you fix an unassigned shard?

Cheers

When you tell Elasticsearch that you want some number of replicas, it tries to find homes for the shards. There are a variety of things that control the logic for where shards go. I'd recommend having a look over https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-cluster.html for more information on the allocation/filtering settings.

The short is that Elasticsearch will avoid placing replica shards on the same node as where the primary shard is (because otherwise a loss of that single node will result in data loss). The best way to "fix" an unassigned shard is to understand why it's unassigned in the first place. There's an API for that! https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-allocation-explain.html tells you both in machine and human forms as to why a shard is unassigned. Sometimes, the allocation failure is transient and can just be fixed with a retry failed shards API call: https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-reroute.html#_retry_failed_shards

the fast is find which indices unassigned close it and open it again,
i slove it as this fast..

You have too many shards, you should reduce them.

Hi @warkolm, Is there a formula for how many shards you should have based on nodes? We have just gone for the default of 5.

Hi @shanec,

Thanks for the reply! loving the explain api! would you be able to help confirm what im reading.

Picking an index that has unassigned indices. I can see the below in Cerberos

elasticIssue01

This means:
Index has 5 shards and is replicated once (for a total of 10 shards)
the primary shards are split

  • 0,1,2 are on XDIDSYq
  • 3,4 are on aiThJE2

Replicated shards are split :

  • 0,1,2 are unassigned
  • 3,4 are on 5y5eSci

I run:

GET /_cluster/allocation/explain
{
  "index": "metricbeat-2017.10.22",
  "shard": 0,
  "primary": true
}

I get a return of:

{
  "index": "metricbeat-2017.10.22",
  "shard": 0,
  "primary": true,
  "current_state": "started",
  "current_node": {
    "id": "X5IDSYqhTtGN6usCsvh1xg",
    "name": "X5IDSYq",
    "transport_address": "10.17.0.33:9300",
    "weight_ranking": 1
  },
  "can_remain_on_current_node": "yes",
  "can_rebalance_cluster": "no",
  "can_rebalance_cluster_decisions": [
    {
      "decider": "rebalance_only_when_active",
      "decision": "NO",
      "explanation": "rebalancing is not allowed until all replicas in the cluster are active"
    },
    {
      "decider": "cluster_rebalance",
      "decision": "NO",
      "explanation": "the cluster has unassigned shards and cluster setting [cluster.routing.allocation.allow_rebalance] is set to [indices_all_active]"
    }
  ],
  "can_rebalance_to_other_node": "no",
  "rebalance_explanation": "rebalancing is not allowed",
  "node_allocation_decisions": [
    {
      "node_id": "5y5eCsinSMuId90SY2KTnw",
      "node_name": "5y5eCsi",
      "transport_address": "10.17.0.66:9300",
      "node_decision": "no",
      "weight_ranking": 2,
      "deciders": [
        {
          "decider": "node_version",
          "decision": "NO",
          "explanation": "target node version [5.6.0] is older than the source node version [5.6.3]"
        }
      ]
    },
    {
      "node_id": "aiThJE2lRUC7XktAkCz7hQ",
      "node_name": "aiThJE2",
      "transport_address": "10.17.0.73:9300",
      "node_decision": "no",
      "weight_ranking": 3,
      "deciders": [
        {
          "decider": "node_version",
          "decision": "NO",
          "explanation": "target node version [5.6.0] is older than the source node version [5.6.3]"
        }
      ]
    }
  ]
}

I was going to start from the top! but the last line is the one that got my attention!

"explanation": "target node version [5.6.0] is older than the source node version [5.6.3]"

Which means the new data node i added is an diff version.. which i can understand would be a problem.. But why is that 90% of the indexs are fine with it and some not?

as shards will not be replicated from the more recent version to the older version.
https://www.elastic.co/guide/en/elasticsearch/reference/current/rolling-upgrades.html

explains why the node is on the newer version and not getting replication.

I have ran upgrade all other nodes following the rolling upgrade process here: https://www.elastic.co/guide/en/elasticsearch/reference/current/rolling-upgrades.html

and everything has turned a pretty shade of Green!

Thanks for taking the time to answer my questions and help!

S

No problem! There is one last question you asked which we haven't answered yet:

There's no "formula" but you should generally try to avoid having more shards than you need. @warkolm has been chasing down such cases on the forums and elsewhere and he's almost certainly right that you're oversharded. We actually have changed the default beat shard counts (to 1 in the case of metricbeat) for 6.0: https://github.com/elastic/beats/issues/5095 because cases of oversharding are so common. You definitely shouldn't have more shards than nodes: you won't gain anything from that and it will slow things down, eat more filehandles, increase the cluster state, and other "nasties." Querying generally gets faster with lower shard counts, but the flip side is just if your cluster can't keep up with the indexing rate of 1 shard (higher shard counts on higher node counts can increase the indexing throughput). If you're so inclined, you can use our benchmarking tool, rally, to benchmark what different shard counts look like for you. Or you can just set it to 1 and that's probably better here.

Oversharding is such a common problem that I wrote a blog post providing some guidance, which may be worthwhile reading.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.