Help me understand shard allocation

I have 8 physical nodes in a hot warm configuration.

Hot, yellow below. 2 Nodes, 800GB NVMe storage
Warm, Red/Not Highlighted, 6 nodes, 5.5TB spinning disk

Both Hot and Warm nodes are divided into two 'racks' (they aren't actually in different racks, it is just so we can peform faster rolling restarts of the entire cluster). They have attributes like this:

      "attributes" : {
    "rack_id" : "rack_1",
    "xpack.installed" : "true",
    "box_type" : "warm"
  },

What I don't understand is why the two red highlighted nodes below have such a free space disparity vs. the other four. These two are actually in excess of the first highwater-mark at the moment and are no longer accepting new allocations while the other four are still well away from that threshold.

Does anyone have any insights to share on where this behavior is coming from?

This is the expected behaviour. Elasticsearch will permit a node's disk usage to get up to the high watermark before it starts to take evasive action. Can you explain more clearly why it's causing you a problem here?

Your shard counts and heap usage look to be very high across the board. Here is an article about shard sizing that may help you configure your cluster more efficiently:

Can you explain more clearly why it's causing you a problem here?

I guess my concern is if one of the top four nodes were to fail, the bottom two already being maxed out means we will be able to recover fewer shards than we would if the data were more balanced.

Additionally, there are a few indices that have 7 replicas and are not pinned to hot/warm and rack_1/rack_2 (searchguard). Occasionally something (??? What, I don't know...) will happen to one of those replicas on the two red nodes and it will not be able to allocate because they have hit the high watermark. It is annoying more than anything else as it leaves the cluster in YELLOW and we get paged.

Your shard counts and heap usage look to be very high across the board. Here is an article about shard sizing that may help you configure your cluster more efficiently:

Yeah, we're acutely aware. It stems from ignorance about sharding and index creation/management when this cluster was setup. We're actively working to reduce the shard count and looking at transitioning to the Index Lifecycle Management features that are built in now, rather than letting logstash determine with index names and counting on curator to move things around for us at fixed times. The numbers you see are actually a substantial improvement from where they were. I think we were close to 1800 shards/node at one point.

Thanks!

I've read of orphaned space in ES storage, I wonder if that might be happening here. I think you could use _cat/shards to get a full inventory but you would probably have to create a script to get a sum of the space ES thinks is in use on these 2 nodes vs. what the filesystem thinks is free.

Are these 2 nodes in the same or different racks?

I think ES balances nodes by shard count, not space. Our nodes run about 10% space difference when the shard counts match. I think it's unlikely that these 2 nodes just happen to have larger shards, but I guess it's possible.

I don't follow. If you lost one of the nodes with lower disk usage then you're right that there'd be less free space in the remaining cluster compared with one that has perfectly even disk usage, but you also lost correspondingly less data than you would have done in a perfectly even disk usage cluster. These effects balance out, don't they?

FWIW I think you would struggle to recover completely from the loss of a node no matter how your shards are currently allocated. You have 5.5TB of space on each of 6 nodes which adds up to 33TB in total. The stats above indicate you have ~7.25TB of free space, so 33-7.25=25.75TB of data on the warm nodes. If you lost a warm node you would have 5.5*5=27.5TB of total disk remaining, so 24.75TB of space below the default high watermark of 90%, which would not fit your 25.75TB of data however you divide it up.

The "something" is almost certainly described in the logs. It may be cryptic, but if you share the logs here perhaps we can help work out what's causing that. Do you really need 7 replicas of some indices? Why?

I think the free space information reported here is coming directly from the filesystem.

In this case the emptiest nodes of the nodes has ~4TB of data and the fullest has ~4.7TB, a difference of ~17%. That doesn't seem too surprising, particularly if shards have quite a spread of sizes. I've certainly seen larger disparities.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.