Unexpected and uneven per-index shard allocation

Hello,

We have a 5-node ES 2.3.3 cluster that we use for a write-heavy logging workload. We keep 90 daily indices on it, each with the default settings of {"number_of_shards": 5, "number_of_replicas": 1}. Each index contains about 80M documents and takes up about 70GB.

When new indices are created, they almost always have an uneven shard allocation. The following allocation is pretty typical for our 5 (P)rimaries and 5 (R)eplicas per index:

Node 1: P0 P1 P2 P3 P4
Node 2: R1
Node 3: R3
Node 4: R2
Node 5: R0 R4

It's worth noting that the exact allocation for Nodes 2-5 will vary a bit, but Node 1 almost always gets all 5 primary shards.

I understand that we don't need to worry about the uneven distribution of primaries, but we are seeing poor bulk indexing performance with the uneven default allocation. When we manually adjust the allocation to an even 2 shards per node--what we would expect by default--we have no indexing bottlenecks.

I'm wondering what we can do to automatically get an even per-index shard allocation. I've found some hacky approaches like initially applying a restrictive total_shards_per_node when creating an index, then relaxing it later. Setting {"cluster.routing.allocation.balance.index": 1} doesn't seem to affect initial allocations.

(I also tried creating empty indices with 1 replica and 10, 15, and 20 shards. In all cases, I ended up with 5 primaries on Node 1. This worked well for the 10-shard index, which was almost perfectly even, but resulted in very uneven allocations for the 15- and 20-shard indices. I also tried creating a 5-shard 0-replica index, and all 5 primaries were again on Node 1. I don't think any of this is particularly relevant to the solution of the current problem, but it does point to some affinity that Node 1 has for hosting 5 shards per index!)

Key settings are below. Happy to provide any other settings/details that would be helpful to see.

Thanks!


Cluster-wide /etc/elasticsearch/elasticsearch.yml:

cluster.name: cluster_name
node.name: node_name
path.data: /path/to/elasticsearch/data0,/path/to/elasticsearch/data1
path.logs: /path/to/elasticsearch/log
network.bind_host: "0.0.0.0"
network.publish_host: _non_loopback_
discovery:
  type: ec2
  ec2:
    groups: aws-security-group-name
script.engine.groovy.inline.aggs: on

GET /_cluster/settings:

{
  "persistent": {},
  "transient": {
    "cluster.routing.allocation.enable": "all"
    "cluster.routing.allocation.balance.index" : "1",
    "threadpool.bulk.queue_size": "100"
  }
}

Can you set

cluster.routing.rebalance.enable : all
cluster.routing.allocation.allow_rebalance : always

and see if that makes a difference?

Also, what is the disk storage capacity on all 5 of your nodes?

total_shards_per_node is a powerful tool but you have to be careful with it. I'd only apply it to the index I'm currently writing and remove it after the index is doing being written.

The thing about Elasticsearch's allocation is that it thinks of each shard as equal. Along those lines, what are the total shard counts on your nodes? Do you have shards with vastly different sizes? That can sometimes come up and there aren't much better tools than total_shards_per_node.

@jkuang, cluster.routing.rebalance.enable was defaulting to all and I set cluster.routing.allocation.allow_rebalance to always. I got the same uneven allocation when creating a test index with default settings, but seems like I might need to push some data into it to get allow_rebalance to become relevant.

Disk storage capacity by node:

  • Node 1: 2TB, 82% used
  • Node 2: 2TB, 75% used
  • Node 3: 2TB, 74% used
  • Node 4: 4TB, 40% used
  • Node 5: 4TB, 39% used

@nik9000, completely agreed on the risks of total_shards_per_node! I was just relaxing it after writing, but I think you're right that removing it entirely is the safer move.

You're on to something with the total shard counts across nodes:

  • Node 1: 359
  • Node 2: 555
  • Node 3: 554
  • Node 4: 554
  • Node 5: 555

The shard totals here are much larger than the 90 * 10 my original post suggests, as there are about 150 much smaller indices (cumulative data total of ~50GB, but each with 5 shards and 1 replica). If Elasticsearch sees all shards as equal, then this explains why we keep getting so many shards assigned to Node 1.

To help equalize shard counts and sizes as much as possible, seems I should limit the small indices to a single shard. What can I do within an index though to keep each shard approximately the same size, or is that less critical?

Usually each shard in an index is going to be about the same size. Unless you are using _routing or _parent. Usually.

For small indices I'd indeed limit them to a single shard.

Are these time based indices? If so you should be able to make the change to their template to change the number of shards and slowly the cluster should even out. You could also manually _reindex the older ones but you'd have to deal with having two copies of the data for a while (one in the old index, one in the new). But you'd be able to work on that actively and that ought to help with the allocation. If you did that I'd prioritize indices that have many shards on nodes 4 and 5.

@nik9000, all the shards in an index at least have equivalent document counts when their on-disk sizes are different, so that makes sense.

The majority of the small indices are time-based, so I've updated their template settings to limit them to a single shard, and they'll eventually cause less of an imbalance. There might be a few to _reindex, but for the most part I think I'll just wait it out.

Until the per-node shard counts reach a natural equilibrium, I'll continue to use total_shards_per_node to avoid bottlenecks during bulk indexing.

Thanks for your help here--feeling much more confident in my understanding of the allocation dynamics!

You're quite welcome! Have a good night/whatever time it is for you now.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.