A maintenance applied by "System" today caused a restart of a node in our production cluster
The restarted node started reallocating shards, not receiving traffic after 2 hours and the remaining 2 nodes were under huge pressure causing interruptions and timeouts.
The shard reallocation was messed up in the sense that it only allocated shards to 2 of the 3 nodes which were already super high on memory and kept the third node empty. Is it a bug?
We had to manually trigger _cluster/reroute to balance it. Is there a way to always have the shard re-allocation balanced?