Getting disk space error when rolling upgrades

Hello,

I'm currently performing a rolling upgrade from ES 5.6.8 to 6.8.4 following the steps told here

Everything is fine but I'm having troubles with certain nodes and disk space. I'll describe the scenario I'm having troubles with:

Before upgrade:

Node total disk space: 200 GB.
Total shards in this node: 229
Disk used: 71%.

cluster.routing.allocation.disk.watermark.low: 85% (default)
cluster.routing.allocation.disk.watermark.high: 90% (default)
cluster.routing.allocation.disk.watermark.flood_stage: 95% (default)

Upgrade steps:

  • "cluster.routing.allocation.enable": "primaries"
  • stop elasticsearch service.
  • perform upgrade (at this point all 229 shards are unassigned, as expected)
  • start elasticsearch service
  • node joins the cluster
  • "cluster.routing.allocation.enable": null (back to default so shards assign back to the node)

The problem:

227 shards were assigned to the node, as expected, but these 2 remaining shards are still as unassigned. I then proceed to check _cluster/allocation/explain to troubleshoot this issue and I find out the following error:

"deciders": [
{
"decider": "disk_threshold",
"decision": "NO",
"explanation": "allocating the shard to this node will bring the node above the high watermark cluster setting [cluster.routing.allocation.disk.watermark.high=90%] and cause it to have less than the minimum required [0b] of free space (free: [49.8gb], estimated shard size: [50gb])"
}
]

Which makes perfect sense because ES is trying to allocate this shard in a node with no space left.. But hey, here is my question. How does it happen? As I stated above, this node already had all these shards with no problem and only had 70% used disk space. Which is ES blocking it now if there were no problem before the upgrade? Am I missing something?

Thank you in advanced.

Hi Nelson,

You mentioned you have 142Gb of 200Gb used - 85% is 170Gb so your data nodes will not accept a replica of more than 27Gb hence the error is fully expected here if all the data node at that point with a shard of 50Gb it's likely it cannot fit on any data node
low watermark - node will not take in new shard which would take them above that level
high watermark - node will move the shard away providing there is enough room on a data node to stay below low watermark after adding the shard

So here you need bigger disks on your data nodes so all replicas can be assigned

Also you mentioned 229 shards is 142Gb - while you have a single shard of 50Gb (which is not a bad size it all depends), you are likely to see some unbalance if you mostly have tiny shards and a few large ones

Hope this helps!

Hello Julien,

Thanks for the fast reply :slightly_smiling_face:I really appreciate it.

I understand your point, however, what I really don't get is the following:

I have 142 GB used (as you said) but at this point, the upgrade has not started, therefore all these 229 shards (including that 50GB big one) is currently hosted in this node (inside the 142Gb used)

When I perform the upgrade and the node joins back the cluster, used disk space should be almost empty because shards are still unassigned. Then ES should allocate these 229 shards back to the node and used disk space should be again ~142 GB as expected. However, it's trying to allocated 142 GB + 50 GB and this is what I really don't get since these 50GB are already inside in the 142 one.

Elasticsearch doesn't delete any shard data when a shard is unassigned, since it reuses that data when recovering a replica. However it cannot tell how much of that data can be reused before assigning the replica, so it has to assume it needs enough space to copy the whole shard over again. You don't have enough space to do that, so it's stuck.

Can you free up some space on this node? For instance, maybe use shard allocation filtering to move some other shards off this node?

If the 50Gb replica had been assigned first then you might have been ok as the rest of shards would allocate where there is room on other nodes... However per allocation explain if all nodes have 27Gb left when the 50Gb shard tries to find a suitable data node; the shard will remain unassigned.

A bad workaround solution would be to split the index into more smaller shards. But note shards have associated costs and size of shard should not be dictated by the current issue you have, you should use capacity planning to decide on shard size instead
Also for future reference, read up on priority to understand how elasticsearch decide which shard to recover/allocate first (again not great workaround solution to try and make sure biggest shard get assigned first)

You should normally plan enough disk space on the data nodes so that if one data node fails, the other data nodes can happily store the extra load. Deletion of data might also help if you went too high on disk space, but 200Gb for data node really sounds very small here

Hey, thank you both for your replies.

I forgot to mention that this is a "test" migration process we are performing on a sandbox cluster which receives production traffic but just for testing purposes. That's why a single node has a low 200GB disk space. However, we wanted to test how this process was going to be like, since the rolling upgrade process is exact the same to production clusters even when these nodes disk space are significantly higher (around 2 TB).

What we want to achieve?

Since we need to perform this rolling upgrade on all our production clusters, all we wanted to do is a sequential upgrade, node by node. This mean stop elasticsearch, perform the upgrade, start elasticsearch and assign all the unassigned shards to the same node. We want to avoid at all cost ES to reallocate these shards to new nodes because that would mean additional write/read IOPS and cpu usage which could lead to cluster performance issues.

@Julien: I'll take a look on these links and see if that could help us.