Unbalanced shards

I have a cluster of over 25 nodes and the indices are balanced on all the nodes except for one node. Let's call the node as problematic node. The problematic node has less indices and shards when compared to other nodes in the cluster. And there are no unassigned shards on the cluster

I have checked

  1. Node settings, same as other nodes
  2. Settings and mappings on the indices that are missing on the problematic node, its ip is no excluded
  3. ES logs don't have anything relevant to find out the cause
  4. Tried to get allocation explain (GET /_cluster/allocation/explain) seems like it works when we have unassigned shards.

I also observed there is constant shard relocation on the problematic node. The shard count is keep fluctuating.
I don't see a reason why shardallocator has to put fewer shards on this node. Can someone please throw some light here?


What do the master logs that mention that node show?
What do the actual logs on that node show?

What's the output from _cat/allocation?v?

I went through the master logs and the problematic node logs. I Didn't find anything related to the issue. The node is like the other nodes.

The cat allocation shows less data on the node.

I have tried replacing the node with a new node and seeing the same behavior.

Is the storage same on every node ?

Yes, storage is same as other nodes. the node config(Same heap,disk,machine,CPU) and ES settings are same as other nodes

Can you share the output of :
GET _cluster/settings

Need to look at the routing settings in the cluster .
Also good to verify if Elasticsearch.yml file is consistent in all nodes .

Thanks Dinesh, These are my routing settings

"cluster.routing.allocation.allow_rebalance" : "indices_all_active",
    "cluster.routing.allocation.awareness.attributes" : [ ],
    "cluster.routing.allocation.balance.index" : "0.55",
    "cluster.routing.allocation.balance.shard" : "0.45",
    "cluster.routing.allocation.balance.threshold" : "1.0",
    "cluster.routing.allocation.disk.include_relocations" : "true",
    "cluster.routing.allocation.disk.reroute_interval" : "60s",
    "cluster.routing.allocation.disk.threshold_enabled" : "true",
    "cluster.routing.allocation.disk.watermark.enable_for_single_data_node" : "false",
    "cluster.routing.allocation.enable" : "all",
    "cluster.routing.allocation.node_concurrent_incoming_recoveries" : "2",
    "cluster.routing.allocation.node_concurrent_outgoing_recoveries" : "2",
    "cluster.routing.allocation.node_initial_primaries_recoveries" : "4",
    "cluster.routing.allocation.same_shard.host" : "false",
    "cluster.routing.allocation.shard_state.reroute.priority" : "NORMAL",
    "cluster.routing.allocation.total_shards_per_node" : "-1",
    "cluster.routing.allocation.type" : "balanced",
    "cluster.routing.rebalance.enable" : "all",
    "cluster.routing.use_adaptive_replica_selection" : "true",

We spin up the clusters with a pipeline and every node gets the same settings like others. I verified and they are same as others

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.