Large shards are not moving from old ES version nodes to new ES version nodes

I have 6.8 cluster which is being upgraded to 7.6 cluster. Shard relocation on all the shards which are smaller in size is completed which is basically moving from 6.8 to 7.6 data nodes. The large shards for some of the indices are not being relocated to 7.6 nodes and we are seeing exception NodeNotConnectedException. Cluster shows that both old and new nodes are already in cluster and as I mentioned smaller shards have already moved to new nodes.
Please provide your inputs on how to handle this situation.

You didn't share any logs yet so it's hard to be certain, but if you are seeing shard movements fail with a NodeNotConnectedException then that means what it says: the nodes are not connected to each other. Elasticsearch requires (reliable & stable) connectivity between all nodes in a cluster.

Hi @DavidTurner, thank you for response.

  • Both the old and new data nodes have master node ip addresses in the discovery.zen.ping.unicast.hosts inside elasticsearch.yml.
    Also when I check _cat/nodes, I do see both the old and new nodes listed. How else can I check if nodes are connected to each other or not?
  • Logs on both new and old nodes do not have any other details than NodeNotConnectedException.
    This log is from new node (7.6 version):
   org.elasticsearch.transport.NodeNotConnectedException: [xxxxx.xxxx.local][xx.xx.xx.xx:9300] Node not connected
        at org.elasticsearch.transport.ConnectionManager.getConnection(ConnectionManager.java:191) ~[elasticsearch-7.6.2.jar:7.6.2]
        at org.elasticsearch.transport.TransportService.getConnection(TransportService.java:637) ~[elasticsearch-7.6.2.jar:7.6.2]
        at org.elasticsearch.action.search.SearchTransportService.getConnection(SearchTransportService.java:393) ~[elasticsearch-7.6.2.jar:7.6.2]
        at org.elasticsearch.action.search.TransportSearchAction.lambda$buildConnectionLookup$6(TransportSearchAction.java:536) ~[elasticsearch-7.6.2.jar:7.6.2]
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.getConnection(AbstractSearchAsyncAction.java:579) ~[elasticsearch-7.6.2.jar:7.6.2]
        at org.elasticsearch.action.search.SearchDfsQueryThenFetchAsyncAction.executePhaseOnShard(SearchDfsQueryThenFetchAsyncAction.java:56) ~[elasticsearch-7.6.2.jar:7.6.2]
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.lambda$performPhaseOnShard$3(AbstractSearchAsyncAction.java:228) ~[elasticsearch-7.6.2.jar:7.6.2]
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.performPhaseOnShard(AbstractSearchAsyncAction.java:263) ~[elasticsearch-7.6.2.jar:7.6.2]
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:395) [elasticsearch-7.6.2.jar:7.6.2]...

That just means that the nodes are connected to the master, not to each other.

If they are reporting NodeNotConnectedException then they are not connected.

Thanks @DavidTurner. Logs do not have much details. What could be the reason? Does this happen if the shard size is higher? What I should be checking which would lead to finding the problem?

No, I don't think this is anything to do with the sizes of the shards. There might be clues in the logs that you're missing, it's hard to say without seeing them. A NodeNotConnectedException indicates that the problem is to do with connectivity, so that's what you should check.

This is really weird as we see that some of our indices' shards have already moved the new data nodes.
Is it possible that after certain time nodes started to not discover each other?

Yes, it's entirely possible that your network was working ok in the past but isn't any more. It's also possible that it's only sporadically broken and has been healthy for long enough at a time to recover the smaller shards. I don't think this changes anything: the only way to resolve a NodeNotConnectedException is to fix whatever connectivity issue is causing it.

TBC this is not a discovery problem, discovery is only concerned with finding (or electing) a master and this is apparently working fine for you since GET /_cat/nodes returns the expected results.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.