Large shards are not moving from old ES version nodes to new ES version nodes

163satish · January 22, 2021, 3:22am

I have 6.8 cluster which is being upgraded to 7.6 cluster. Shard relocation on all the shards which are smaller in size is completed which is basically moving from 6.8 to 7.6 data nodes. The large shards for some of the indices are not being relocated to 7.6 nodes and we are seeing exception NodeNotConnectedException. Cluster shows that both old and new nodes are already in cluster and as I mentioned smaller shards have already moved to new nodes.
Please provide your inputs on how to handle this situation.

DavidTurner · January 22, 2021, 7:35am

You didn't share any logs yet so it's hard to be certain, but if you are seeing shard movements fail with a NodeNotConnectedException then that means what it says: the nodes are not connected to each other. Elasticsearch requires (reliable & stable) connectivity between all nodes in a cluster.

163satish · January 22, 2021, 7:59am

Hi @DavidTurner, thank you for response.

Both the old and new data nodes have master node ip addresses in the discovery.zen.ping.unicast.hosts inside elasticsearch.yml.
Also when I check _cat/nodes, I do see both the old and new nodes listed. How else can I check if nodes are connected to each other or not?
Logs on both new and old nodes do not have any other details than NodeNotConnectedException.
This log is from new node (7.6 version):

   org.elasticsearch.transport.NodeNotConnectedException: [xxxxx.xxxx.local][xx.xx.xx.xx:9300] Node not connected
        at org.elasticsearch.transport.ConnectionManager.getConnection(ConnectionManager.java:191) ~[elasticsearch-7.6.2.jar:7.6.2]
        at org.elasticsearch.transport.TransportService.getConnection(TransportService.java:637) ~[elasticsearch-7.6.2.jar:7.6.2]
        at org.elasticsearch.action.search.SearchTransportService.getConnection(SearchTransportService.java:393) ~[elasticsearch-7.6.2.jar:7.6.2]
        at org.elasticsearch.action.search.TransportSearchAction.lambda$buildConnectionLookup$6(TransportSearchAction.java:536) ~[elasticsearch-7.6.2.jar:7.6.2]
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.getConnection(AbstractSearchAsyncAction.java:579) ~[elasticsearch-7.6.2.jar:7.6.2]
        at org.elasticsearch.action.search.SearchDfsQueryThenFetchAsyncAction.executePhaseOnShard(SearchDfsQueryThenFetchAsyncAction.java:56) ~[elasticsearch-7.6.2.jar:7.6.2]
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.lambda$performPhaseOnShard$3(AbstractSearchAsyncAction.java:228) ~[elasticsearch-7.6.2.jar:7.6.2]
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.performPhaseOnShard(AbstractSearchAsyncAction.java:263) ~[elasticsearch-7.6.2.jar:7.6.2]
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:395) [elasticsearch-7.6.2.jar:7.6.2]...

DavidTurner · January 22, 2021, 8:32am

That just means that the nodes are connected to the master, not to each other.

If they are reporting NodeNotConnectedException then they are not connected.

163satish · January 22, 2021, 9:50am

Thanks @DavidTurner. Logs do not have much details. What could be the reason? Does this happen if the shard size is higher? What I should be checking which would lead to finding the problem?

DavidTurner · January 22, 2021, 10:18am

No, I don't think this is anything to do with the sizes of the shards. There might be clues in the logs that you're missing, it's hard to say without seeing them. A NodeNotConnectedException indicates that the problem is to do with connectivity, so that's what you should check.

163satish · January 22, 2021, 5:22pm

This is really weird as we see that some of our indices' shards have already moved the new data nodes.
Is it possible that after certain time nodes started to not discover each other?

DavidTurner · January 22, 2021, 6:19pm

Yes, it's entirely possible that your network was working ok in the past but isn't any more. It's also possible that it's only sporadically broken and has been healthy for long enough at a time to recover the smaller shards. I don't think this changes anything: the only way to resolve a NodeNotConnectedException is to fix whatever connectivity issue is causing it.

TBC this is not a discovery problem, discovery is only concerned with finding (or electing) a master and this is apparently working fine for you since GET /_cat/nodes returns the expected results.

system · February 19, 2021, 6:20pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Data nodes are connected to the master but not to each other during upgrade Elasticsearch	1	335	August 9, 2021
Nodes leaving Elastic Search cluster Elasticsearch	1	419	November 2, 2020
Shards refuse to relocate to different nodes using cluster.routing.allocation.exclude Elasticsearch	3	2208	July 13, 2019
Unassigned shards during data nodes move Elasticsearch	1	601	June 13, 2019
What to do about shards from mixed versions Elasticsearch	2	578	March 30, 2021

Large shards are not moving from old ES version nodes to new ES version nodes

Related topics