I have two nodes in my Elasticsearch cluster and have same configurations and environment.
cat /etc/centos-release
CentOS Linux release 7.8.2003 (Core)
uname -r
3.10.0-1127.8.2.el7.x86_64
sysctl vm.max_map_count
vm.max_map_count = 262144
d
/etc/security/limits.conf
root soft nofile 65535
root hard nofile 65535
- soft nofile 65535
- hard nofile 65535
bin/Elasticsearch -V
Version: 7.14.0, Build: default/tar/dd5a0a2acaa2045ff9624f3729fc8a6f40835aa1/2021-07-29T20:49:32.864135063Z, JVM: 16.0.1
My Elasticsearch cluster worked fine for many days. Suddenly, I found one node shut down and some messages in logs:
[2021-12-20T16:35:59,333][INFO ][o.e.x.m.p.NativeController] [node-2] Native controller process has stopped - no new native processes can be started
[2021-12-20T16:35:59,335][INFO ][o.e.n.Node ] [node-2] stopping ...
[2021-12-20T16:35:59,340][INFO ][o.e.x.w.WatcherService ] [node-2] stopping watch service, reason [shutdown initiated]
[2021-12-20T16:35:59,340][INFO ][o.e.x.w.WatcherLifeCycleService] [node-2] watcher has stopped and shutdown
[2021-12-20T16:35:59,763][INFO ][o.e.i.s.GlobalCheckpointSyncAction] [node-2] [room-20211220][0] global checkpoint sync failed
org.elasticsearch.node.NodeClosedException: node closed {node-2}{8R75eOm0RMu3z-OMmsNwMw}{EwNsmybASQSOKav2i_5gqA}{10.205.205.106}{10.205.205.106:9300}{cdfhilmrstw}{ml.machine_memory=135027458048, xpack.installed=true, transform.node=true, ml.max_open_jobs=512, ml.max_jvm_size=32212254720}
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase$2.onClusterServiceClose(TransportReplicationAction.java:839) [elasticsearch-7.14.0.jar:7.14.0]
at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onClusterServiceClose(ClusterStateObserver.java:317) [elasticsearch-7.14.0.jar:7.14.0]
at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onClose(ClusterStateObserver.java:226) [elasticsearch-7.14.0.jar:7.14.0]
at org.elasticsearch.cluster.service.ClusterApplierService.addTimeoutListener(ClusterApplierService.java:252) [elasticsearch-7.14.0.jar:7.14.0]
at org.elasticsearch.cluster.ClusterStateObserver.waitForNextChange(ClusterStateObserver.java:165) [elasticsearch-7.14.0.jar:7.14.0]
at org.elasticsearch.cluster.ClusterStateObserver.waitForNextChange(ClusterStateObserver.java:109) [elasticsearch-7.14.0.jar:7.14.0]
at org.elasticsearch.cluster.ClusterStateObserver.waitForNextChange(ClusterStateObserver.java:101) [elasticsearch-7.14.0.jar:7.14.0]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.retry(TransportReplicationAction.java:831) [elasticsearch-7.14.0.jar:7.14.0]
And I saw the other node can't connect the node at same time in its logs.
[2021-12-20T16:35:59,761][INFO ][o.e.c.c.Coordinator ] [node-1] master node [{node-2}{8R75eOm0RMu3z-OMmsNwMw}{EwNsmybASQSOKav2i_5gqA}{10.205.205.106}{10.205.205.106:9300}{cdfhilmrstw}{ml.machine_memory=135027458048, ml.max_open_jobs=512, xpack.installed=true, ml.max_jvm_size=32212254720, transform.node=true}] failed, restarting discovery
org.elasticsearch.transport.NodeDisconnectedException: [node-2][10.205.205.106:9300][disconnected] disconnected
[2021-12-20T16:35:59,769][INFO ][o.e.c.s.ClusterApplierService] [node-1] master node changed {previous [{node-2}{8R75eOm0RMu3z-OMmsNwMw}{EwNsmybASQSOKav2i_5gqA}{10.205.205.106}{10.205.205.106:9300}{cdfhilmrstw}], current []}, term: 15, version: 4296, reason: becoming candidate: onLeaderFailure
[2021-12-20T16:35:59,770][WARN ][o.e.c.NodeConnectionsService] [node-1] failed to connect to {node-2}{8R75eOm0RMu3z-OMmsNwMw}{EwNsmybASQSOKav2i_5gqA}{10.205.205.106}{10.205.205.106:9300}{cdfhilmrstw}{ml.machine_memory=135027458048, ml.max_open_jobs=512, xpack.installed=true, ml.max_jvm_size=32212254720, transform.node=true} (tried [1] times)
org.elasticsearch.transport.ConnectTransportException: [node-2][10.205.205.106:9300] connect_exception
at org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onFailure(TcpTransport.java:988) ~[elasticsearch-7.14.0.jar:7.14.0]
at org.elasticsearch.action.ActionListener.lambda$toBiConsumer$0(ActionListener.java:277) ~[elasticsearch-7.14.0.jar:7.14.0]
at org.elasticsearch.core.CompletableContext.lambda$addListener$0(CompletableContext.java:31) ~[elasticsearch-core-7.14.0.jar:7.14.0]
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) ~[?:?]
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837) ~[?:?]
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[?:?]
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2158) ~[?:?]
at org.elasticsearch.core.CompletableContext.completeExceptionally(CompletableContext.java:46) ~[elasticsearch-core-7.14.0.jar:7.14.0]
at org.elasticsearch.transport.netty4.Netty4TcpChannel.lambda$addListener$0(Netty4TcpChannel.java:57) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:570) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:549) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:608) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117) ~[?:?]
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:321) ~[?:?]
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:337) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:702) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:615) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:578) ~[?:?]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) ~[?:?]
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) ~[?:?]
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?]
at java.lang.Thread.run(Thread.java:831) [?:?]
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 10.205.205.106/10.205.205.106:9300
Caused by: java.net.ConnectException: Connection refused