Replica shards do not get assigned (not consistent), even though they can

Hey,

I've been having a shard-assignment issue with my Elastic 5.4.0 ever since I started using it, a problem that also happened in the previous version we used, 1.7.

We have 6 nodes in the cluster, 3 of them are dedicated Master Nodes and 3 are Data Nodes.

The specs for the nodes are -

Master Node - 14GB RAM (7GB configured for JVM), 4 CPU
Data Node - 14GB RAM (7GB configured for JVM), 4 CPU, 2 disks of 50 GB each (So 6 disks total and 300GB storage for the whole cluster)

Each index we create in the cluster is created with 3 primary shards, and 1 replica shard,
and the allocation configuration set the total_shards_per_node to be 2.
So ideally, for each index, each node should 1 primary shard and 1 replica shard, in a "circular" assignment like this -

Node 1 - P1, R3
Node 2 - P2, R1
Node 3 - P3, R2

All indices should be assigned like this, and cluster should end up in a fully green state.

The problem I see is that Elastic doesn't always assign the shards in the circular way I described above, but sometimes it assigns them in a way that doesn't leave it with any node for the last replica. The assignment it sometimes does is -

Node 1 - P1, __
Node 2 - P2, R3
Node 3 - P3, R2

Thus, leaving it with no node left for replica R1, since it can't be on the same node with P1 of course (Node 1), and each node can't have more than 2 shards on it.

I can't understand why it only sometimes happen, and why Elastic fails to figure out the circular allocation route that will allow him to set all the replica nodes where they should be.
So we have a yellow cluster due to this unsuccessful assignment of shards.

I'm attaching here all configuration and API calls that might be relevant to debug the issue -

Elastic version (from the / endpoint)

Summary
"version": {
        "number": "5.4.0",
        "build_hash": "780f8c4",
        "build_date": "2017-04-28T17:43:27.229Z",
        "build_snapshot": false,
        "lucene_version": "6.5.0"
    }

Relevant problematic index settings

Summary
{
  "index-name": {
    "settings": {
      "index": {
        "routing": {
          "allocation": {
            "total_shards_per_node": "2"
          }
        },
        "mapping": {
          "total_fields": {
            "limit": "2000"
          }
        },
        "number_of_shards": "3",
        "unassigned": {
          "node_left": {
            "delayed_timeout": "15m"
          }
        },
        "number_of_replicas": "0"
      }
    }
  }
}

The unassigned shard status from _cat/shards API call
index-name 2 r UNASSIGNED REPLICA_ADDED

The result of the /_cluster/allocation/explain API for the problematic index and the problematic replica shard that can't be assigned

Summary
{
  "index": "index-name",
  "shard": 2,
  "primary": false,
  "current_state": "unassigned",
  "unassigned_info": {
    "reason": "REPLICA_ADDED",
    "at": "2018-03-12T16:04:37.589Z",
    "last_allocation_status": "no_attempt"
  },
  "can_allocate": "no",
  "allocate_explanation": "cannot allocate because allocation is not permitted to any of the nodes",
  "node_allocation_decisions": [
    {
      "node_id": "D_0f7PlUQh2xCQVJ04mXkw",
      "node_name": "Node3",
      "transport_address": "IP",
      "node_decision": "no",
      "deciders": [
        {
          "decider": "shards_limit",
          "decision": "NO",
          "explanation": "too many shards [2] allocated to this node for index [index-name], index setting [index.routing.allocation.total_shards_per_node=2]"
        }
      ]
    },
    {
      "node_id": "pzVpw4vtQg-ecAGoYkQzqw",
      "node_name": "Node1",
      "transport_address": "IP1",
      "node_decision": "no",
      "store": {
        "matching_size_in_bytes": 130
      },
      "deciders": [
        {
          "decider": "same_shard",
          "decision": "NO",
          "explanation": "the shard cannot be allocated to the same node on which a copy of the shard already exists [[index-name][2], node[Node1], [P], s[STARTED], a[id=N9RA7hJ5SquJK1y_vgKo0Q]]"
        }
      ]
    },
    {
      "node_id": "zJBuKkKlTbWZCofdn1NCvg",
      "node_name": "Node2",
      "transport_address": "IP",
      "node_decision": "no",
      "deciders": [
        {
          "decider": "shards_limit",
          "decision": "NO",
          "explanation": "too many shards [2] allocated to this node for index [index-name], index setting [index.routing.allocation.total_shards_per_node=2]"
        }
      ]
    }
  ]
}

I'll add that if I remove the replica and add it a few times, eventually Elastic will have a successful assignment (of the circular case).

Let me know if any additional API calls are needed and I'll gladly add those. :slight_smile:

Thanks in advance for any help!

Are all nodes running exactly the same version of Elasticsearch?

Yes, they are all running the same Elastic version (5.4.0, build hash 780f8c4).

Have you tried increasing total_shards_per_node to 3 and then reduce it back to 2 to force reallocation? You could also use the the cluster reroute API to move one of the existing shards.

Haven't tried increasing the total_shards_per_node to 3, because I don't want to get the cluster into a state of a lot of concurrent allocations. Last time we got the cluster into a state of a lot of concurrent allocations, he couldn't handle the pressure and we had a long downtime until it got back..

I will try the cluster reroute API to make the index I described fixed, and I'll report how it went.
But that would mean I have to manually maintain this every X time? Our index creation is ongoing and not a one-time thing...

Thanks for your help :slight_smile:

This sounds similar to issue 12273 - a known issue related to use of total_shards_per_node. Unfortunately with 3 nodes and 3 shards each with 1 replica you've a 2/3 chance of hitting it. For larger shard counts I think the probability gets closer to 1/3 (I've no idea what happens with more replicas).

What would be another way to balance cluster's shards, not using the total_shards_per_node configuration?

Before we started using the total_shards_per_node config with the described amount of shards, the cluster was busy a lot of time doing reallocations to shards and that caused him a lot of timeouts handling requests (both indexing and searches).

This seems unexpected and worthy of further investigation. More precisely, it's unexpected both that there was a lot of reallocation activity, and that such activity caused a lot of timeouts.

Maybe it's the amount of indices we are holding in the cluster. We have about 14K (of about 2500 indices) shards spread over the cluster of 3 Data Nodes. But, 90% of them are rather small shards, some even less than 1MB. I realize that a shard has an overhead regardless of how big it is, but I don't except it to affect the assignment of shards.

The reason we have so many indices is that we use a customer-specific index, and not a single index that holds a lot of customers. It is a part of our architecture today (but can be changed if we won't see a better way to workaround this).

How would you recommend to go around it? Cancel the total_shards_per_node limitation and let elastic decide how to balance the shards? Probably add more nodes to have it balance more easily?

Another reason for the use of total_shards_per_node is that at certain times elastic would decide, seemingly out of the blue, to reallocate shards. When we have the total_shards_per_node, elastic cannot move them around, even if it wants to. That could be a problem when we don't have the total_shards_per_node...

That seems like far too many shards.

This is the puzzling bit. It'll have a reason for doing this. Without further digging into the logs I can only speculate, but one possibility is the following: if one of the nodes disconnected from the cluster for longer than the delayed allocation timeout then all its shards would be allocated across the two remaining nodes; when the failed node reconnected to the cluster then rebalancing would try to even things out again. If you have total_shards_per_node enabled then the unassigned shards will remain unassigned until the third node comes back, putting your data at higher risk, but meaning that no shards can move while the node is unavailable.

You can, of course, disable rebalancing without using total_shards_per_node.

However, I wonder if the large number of shards is causing the problems you're seeing. Reallocating shards when a node disconnects is normal, and rebalancing after a disconnected node reconnects is normal too, but the number of shards involved might mean you're hitting a limit somewhere.

If disconnections and reconnections are the cause of the rebalancing then you need to be careful reducing the number of primaries per index, or increasing the number of nodes in the cluster, as this will mean that total_shards_per_node is no longer enough to prevent allocation or rebalancing when nodes connect or disconnect.

On the other hand indices with 1 primary (or even 2) won't suffer from the problem from the OP if you've 3 nodes, even with total_shards_per_node in place.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.