Large-scale elasticsearch update cluster setting very slow

I have an elasticsearch cluster with 3PB data over 100+ nodes,but when I change cluster setting,for example I close an index,it will took ten or twenty minutes.After analysis hot_threads,I think the reason is rebalance,I have set cluster.routing.rebalance.enable = none,but it will still execute clusterRebalanceAllocationDecider:canRebalance.I think this part can be optimized.

What is your elasticsearch version ?

elasticsearch-6.4,I have 140000+ shards with hot-cold data.I think the outermost layer of this method should be judged, if cluster.routing.rebalance.enable = none,then return Decision.NO。Because I don't want it to do a rebalance operation。thanks.

@Override
public Decision canRebalance(RoutingAllocation allocation) {
    if (type == ClusterRebalanceType.INDICES_PRIMARIES_ACTIVE) {
        // check if there are unassigned primaries.
        if ( allocation.routingNodes().hasUnassignedPrimaries() ) {
            return allocation.decision(Decision.NO, NAME,
                    "the cluster has unassigned primary shards and cluster setting [%s] is set to [%s]",
                    CLUSTER_ROUTING_ALLOCATION_ALLOW_REBALANCE, type);
        }
        // check if there are initializing primaries that don't have a relocatingNodeId entry.
        if ( allocation.routingNodes().hasInactivePrimaries() ) {
            return allocation.decision(Decision.NO, NAME,
                    "the cluster has inactive primary shards and cluster setting [%s] is set to [%s]",
                    CLUSTER_ROUTING_ALLOCATION_ALLOW_REBALANCE, type);
        }

        return allocation.decision(Decision.YES, NAME, "all primary shards are active");
    }
    if (type == ClusterRebalanceType.INDICES_ALL_ACTIVE) {
        // check if there are unassigned shards.
        if (allocation.routingNodes().hasUnassignedShards() ) {
            return allocation.decision(Decision.NO, NAME,
                    "the cluster has unassigned shards and cluster setting [%s] is set to [%s]",
                    CLUSTER_ROUTING_ALLOCATION_ALLOW_REBALANCE, type);
        }
        // in case all indices are assigned, are there initializing shards which
        // are not relocating?
        if ( allocation.routingNodes().hasInactiveShards() ) {
            return allocation.decision(Decision.NO, NAME,
                    "the cluster has inactive shards and cluster setting [%s] is set to [%s]",
                    CLUSTER_ROUTING_ALLOCATION_ALLOW_REBALANCE, type);
        }
    }
    // type == Type.ALWAYS
    return allocation.decision(Decision.YES, NAME, "all shards are active");
}

I don't know for this specific part of the code and how all that has been improved over the last year but @DavidTurner might tell.

Just a note about the number of shards, that looks like above the recommendation of max 20 shards per gb of HEAP. Unless you defined 70gb of HEAP?

Also do you have dedicated master nodes? I suppose you have.

That is indeed a very large number of shards for a single cluster to handle and will result in a large cluster state that can take a while to update. For these data volumes I would recommend having fewer and larger shards. It is also possible that you have reached the point where using a single large cluster is no longer practical and splitting it up into several smaller clusters combined with cross-cluster search might be a good idea. If it takes this long for a planned change I wonder what would happen if you have a few node failures and actually need to rebalance.

yeah,I defined more than 70gb heap on cold node,and I have dedicated master nodes and dedicated coordinate nodes.I did some testing work also.the type == ClusterRebalanceType.INDICES_ALL_ACTIVE,I deliberately made an unassigned shard and let it return Decision.NO directly.This execution will be fast more。

I have made some changes for recommend having fewer and larger shards.I have cluster-level monitoring. If there is a node down, it is okay to find then restart in time, and it will not cause much impact, because most of the data is cold data.

I think clusterRebalanceAllocationDecider:canRebalance make some changes as follows will be better.if you actually need to rebalance have no effect,and will speed up cluster updates.

@Override
public Decision canRebalance(RoutingAllocation allocation) {
if(cluster.routing.rebalance.enable == none) {
return allocation.decision(Decision.NO, ......);
}else {
Original code block;
}
}

I was just looking at a support case about a very similar cluster (≥3PB of data, 170+ nodes, 140,000 shards). Perhaps it's the same one. The real solution is to split this cluster up. The cluster state for the other cluster I was investigating was well over 100MB, and there are likely some other bottlenecks as well as the one mentioned here.

Nonetheless, yes, it turns out that even if cluster.routing.rebalance.enable: none then Elasticsearch still does quite a lot of computation before discovering that rebalancing is impossible when there are so many shards to consider. I opened https://github.com/elastic/elasticsearch/pull/40942 which should improve this particular situation. I cannot guarantee that this is the only issue facing such a large cluster, or that it'll be possible to fix the next problem encountered.

1 Like

thanks a lot, Is the method I proposed feasible? And When will this bug be merged to the master branch and then go online,Is version 7.0 available?I look forward to your reply.

Not quite, but it's close. Firstly, ClusterRebalanceAllocationDecider is the wrong place to do this, it belongs in EnableAllocationDecider. Secondly, you also need to account for setting index.routing.rebalance.enable on an index, because this overrides the cluster-wide setting.

I merged the PR into master yesterday but it contained a mistake and I had to revert it. I opened a fixed PR in Short-circuit rebalancing when disabled by DaveCTurner · Pull Request #40966 · elastic/elasticsearch · GitHub and have merged this to master already. As you can see from the labels on that PR I intend to backport this to the 6.7 branch; we cannot be certain exactly which versions will see this fix until they are released, but it should be listed in the release notes when it is included.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.