We have a 27 data node ES 5.5 cluster with 1.7TB of disk per node. The ES team has created an amazingly easy to use and high performance system. Our cluster normally works beautifully. It ingests over a billion logs per day.
We're trying to run the cluster with less free space than we've maintained in the past (all nodes currently have between 13% and 15% free) and have run into a problem.
Occasionally one node will become imbalanced with a few large, active shards and its free space will drop to under 5%. That triggers disk high water shard relocations. ES will schedule too many moves off this node and cause more high water relocations on other nodes. This situation can cascade out of control like a run-away reactor.
Our shard sizes vary with some indices having shards under 0.1gb, but others around 40gb. 2672 active shards use 40TB of disk. In order to get reasonable allocation for diverse shard sizes, we set cluster.routing.allocation.balance.shard to 0 so that an index's shards are evenly spread across nodes. (This also greatly reduced hot spots in the cluster.)
We'd like to have ES perform the minimal set of relocations necessary to bring a node back above the high water.
We'd also like a variant of BalancedShardsAllocator that prefers nodes with free disk space so that shards are preferentially allocated to nodes with more disk.
When a node runs out of disk it would be great to have a small number of shards move from there to the nodes with the most disk.
Has anyone else seen this behavior? Are there any cluster configurations that would help eliminate these relocation storms?
Does anyone know of work being done to make balancing more disk aware? Are there ES versions that have better behavior than 5.5?
I'm pinging @dakrone as he worked a lot IIRC on this topic.
I know that there have been a lot of improvements in 6.x series but I'm unsure about your specific questions though.
Our cluster normally works beautifully. It ingests over a billion logs per day.
There haven't been many improvements to this area for single-path data configurations since 5.5. I have a couple of questions for you though:
What number of concurrent relocations are you using?
You said you're running at 13-15% free disk, what are your high and low watermarks configured at (you said high was at 5%, but what is the low watermark?)
We have set node_concurrent_recoveries higher in the past (it was helpful for recovering from dead data nodes), but I have it set to 1 now because it seems to reduce the size of a rebalancing storm and will hopefully give us time to intervene.
The cluster runs in a steady state with size of new ingest the same as old indices that are expiring. Sometimes we get unlucky and several of our large indices are placed on a single node pushing it into high water. It happened twice "in the wild" and I repeated it twice more by adjusting the water marks.
When a node hits high water, the first set of shard relocations frees up hundreds of gigs of space. I haven't figured out why so many shards move at once.
Thanks for looking at this. I really appreciate it.
In your case, when you say that large indices are placed on a single node, is this when the indices are newly created, and then they grow to the size that passes the high watermark?
Could you also attach the output of /_cat/allocation?v and /_cat/shards?v please (assuming it's not too sensitive)
The system is just about to expire old indices and allocate new daily indices, so this is the low point of a steady state oscillation. 4 nodes are down to around 10% and the system as a whole is at 12% free. That's a bit lower than usual, but it will probably (hopefully!) bounce back.
Shard allocation always honors the low water mark and with the index weighting they are always spread across the maximum number of nodes. This usually works well.
We have several daily indices that are larger than the others. If one node is unlucky enough to get shards from all of the large indices and its expiring shards don't free up enough space, it can hit the high water by the end of the day. In this case we'll usually see a node at 5% and another at 20%. (I say usually, but it's only happened in the wild twice!)
When the high water is hit on a single node, I don't see just one recovery happening. It's more like a shard per node? When I modified the high water mark to induce this problem, 22 shards started relocating. I didn't verify all of them, but most were replicas.
I'm going to try to reproduce the problem on a smaller cluster. I don't think it's a function of data volume. Hopefully that will give you more information on whether this is a bug or just a system limitation.
I am going to preallocate daily indices ahead of time and move any poorly placed shards. That won't work for everyone, but we have very predictable shard sizes so automation can decide whether it's a poor placement in advance. Moving an empty shard is very fast.
This "just" leaves us with optimal disaster recovery shard placement. Hopefully when that happens we'll be on the version of Elasticsearch that has best-fit shard allocation taking free disk into account!
I've linked this thread as a +1 for #17213 since I think that idea might help here.
In the meantime, I have a hunch that if you can do the expiry of the old (large) indices before creating the new (large) ones, with rebalancing temporarily disabled, then the new ones will be allocated in the spaces that the old ones have left, based entirely on shard counts. Could you try that?
There's also the index.routing.allocation.total_shards_per_node setting which can limit the number of shards of each index allocated to each node. It affects all indices, and is a hard limit so might leave you with a yellow cluster sometimes, but it might help.
We switched to cluster.routing.allocation.balance.index: 1.0 a couple months ago to solve node hotspot issues. The indices that are expiring don't satisfy the flat index balance requirement, but as soon as they are gone I will try deletion immediately before new index creation.
When we upgrade to 6.x, it may be best to split the cluster up. The mixed work load with varying index sizes seems to cause more tuning trouble than running multiple clusters each having a consistent work load.
D'oh! I made a mistake when thinking the old indices were not rebalanced when we made the weighting change to cluster.routing.allocation.balance. Everything is flat across all nodes so I can do that delete experiment today.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.