Hi
I have a two node cluster setup(ES version - 7.1) which was running absolutely fine until one of the node became unreachable. After bringing it up and starting elastic search, it does not add to the cluster.
PFB elasticsearch.yml config
cluster.name: oep_np
node.name: ausdlovpes01_1
node.attr.rack: dev
node.max_local_storage_nodes: 2
node.master: true
node.data: true
path.data: /u01/es/data/es_01
path.logs: /u01/es/logs/es_01
bootstrap.memory_lock: true
network.host: 10.179.192.121
http.port: 8080
transport.port: 8200
transport.publish_port: 8081
transport.profiles.default.port: 8081
discovery.seed_hosts: ["10.179.192.121:8081", "10.179.200.12:8081"]
cluster.initial_master_nodes: ["ausdlovpes01_1","ausilovpes01_1"]
gateway.recover_after_nodes: 2
cluster.routing.allocation.enable: none
cluster.routing.allocation.same_shard.host: true
xpack.security.enabled: false
logger.org.elasticsearch.cluster.coordination.ClusterBootstrapService: TRACE
logger.org.elasticsearch.discovery: TRACE
From the log trace of master node it looks like the second node joins the cluster and within seconds leaves the cluster, I am not getting the reason. Please help to solve the issue.
PFB logs of master node(ausdlovpes01_1) :
[2020-03-19T07:45:48,084][INFO ][o.e.n.Node ] [ausdlovpes01_1] starting ...
[2020-03-19T07:45:48,244][INFO ][o.e.t.TransportService ] [ausdlovpes01_1] publish_address {10.179.192.121:8081}, bound_addresses {10.179.192.121:8081}
[2020-03-19T07:45:48,252][INFO ][o.e.b.BootstrapChecks ] [ausdlovpes01_1] bound or publishing to a non-loopback address, enforcing bootstrap checks
[2020-03-19T07:45:48,259][DEBUG][o.e.d.SeedHostsResolver ] [ausdlovpes01_1] using max_concurrent_resolvers [10], resolver timeout [5s]
[2020-03-19T07:45:48,260][INFO ][o.e.c.c.Coordinator ] [ausdlovpes01_1] cluster UUID [zOSPKsdfSamulWnZ0syk5Q]
[2020-03-19T07:45:48,264][TRACE][o.e.d.PeerFinder ] [ausdlovpes01_1] activating with nodes:
{ausdlovpes01_1}{2jifvTz5SeuUc1MljZua2g}{yYpuu9D7Q0q3QHprUnpPVQ}{10.179.192.121}{10.179.192.121:8081}{ml.machine_memory=8182054912, rack=dev, xpack.installed=true, ml.max_open_jobs=20}, local
[2020-03-19T07:45:48,266][TRACE][o.e.d.PeerFinder ] [ausdlovpes01_1] probing master nodes from cluster state: nodes:
{ausdlovpes01_1}{2jifvTz5SeuUc1MljZua2g}{yYpuu9D7Q0q3QHprUnpPVQ}{10.179.192.121}{10.179.192.121:8081}{ml.machine_memory=8182054912, rack=dev, xpack.installed=true, ml.max_open_jobs=20}, local
[2020-03-19T07:45:48,266][TRACE][o.e.d.PeerFinder ] [ausdlovpes01_1] startProbe(10.179.192.121:8081) not probing local node
[2020-03-19T07:45:48,287][TRACE][o.e.d.SeedHostsResolver ] [ausdlovpes01_1] resolved host [10.179.192.121:8081] to [10.179.192.121:8081]
[2020-03-19T07:45:48,288][TRACE][o.e.d.SeedHostsResolver ] [ausdlovpes01_1] resolved host [10.179.200.12:8081] to [10.179.200.12:8081]
[2020-03-19T07:45:48,290][TRACE][o.e.d.PeerFinder ] [ausdlovpes01_1] probing resolved transport addresses [10.179.200.12:8081]
[2020-03-19T07:45:48,291][TRACE][o.e.d.PeerFinder ] [ausdlovpes01_1] Peer{transportAddress=10.179.200.12:8081, discoveryNode=null, peersRequestInFlight=false} attempting connection
[2020-03-19T07:45:48,295][TRACE][o.e.d.HandshakingTransportAddressConnector] [ausdlovpes01_1] [connectToRemoteMasterNode[10.179.200.12:8081]] opening probe connection
[2020-03-19T07:45:48,331][DEBUG][o.e.d.PeerFinder ] [ausdlovpes01_1] Peer{transportAddress=10.179.200.12:8081, discoveryNode=null, peersRequestInFlight=false} connection failed
org.elasticsearch.transport.ConnectTransportException: [][10.179.200.12:8081] connect_exception
at org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onFailure(TcpTransport.java:1299) ~[elasticsearch-7.1.0.jar:7.1.0]
at org.elasticsearch.action.ActionListener.lambda$toBiConsumer$2(ActionListener.java:99) ~[elasticsearch-7.1.0.jar:7.1.0]
at org.elasticsearch.common.concurrent.CompletableContext.lambda$addListener$0(CompletableContext.java:42) ~[elasticsearch-core-7.1.0.jar:7.1.0]
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) ~[?:?]
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837) ~[?:?]
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[?:?]
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2159) ~[?:?]
at org.elasticsearch.common.concurrent.CompletableContext.completeExceptionally(CompletableContext.java:57) ~[elasticsearch-core-7.1.0.jar:7.1.0]
at org.elasticsearch.transport.netty4.Netty4TcpChannel.lambda$new$1(Netty4TcpChannel.java:72) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:511) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:504) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:483) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:424) ~[?:?]
at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:121) ~[?:?]
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:327) ~[?:?]
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:343) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:556) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:510) ~[?:?]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:470) ~[?:?]
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909) ~[?:?]
at java.lang.Thread.run(Thread.java:835) [?:?]
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /10.179.200.12:8081
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:?]
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:779) ~[?:?]
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327) ~[?:?]
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340) ~[?:?]
... 6 more
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:?]
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:779) ~[?:?]
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327) ~[?:?]
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340) ~[?:?]
... 6 more
[2020-03-19T07:45:48,387][TRACE][o.e.d.PeerFinder ] [ausdlovpes01_1] deactivating and setting leader to {ausdlovpes01_1}{2jifvTz5SeuUc1MljZua2g}{yYpuu9D7Q0q3QHprUnpPVQ}{10.179.192.121}{10.179.192.121:8081}{ml.machine_memory=8182054912, rack=dev, xpack.installed=true, ml.max_open_jobs=20}
[2020-03-19T07:45:48,388][TRACE][o.e.d.PeerFinder ] [ausdlovpes01_1] not active
[2020-03-19T07:45:48,412][INFO ][o.e.c.r.a.AllocationService] [ausdlovpes01_1] updating number_of_replicas to [0] for indices [.kibana_task_manager, .kibana_2, .kibana_1, .tasks]
[2020-03-19T07:45:48,424][INFO ][o.e.c.s.MasterService ] [ausdlovpes01_1] elected-as-master ([1] nodes joined)[{ausdlovpes01_1}{2jifvTz5SeuUc1MljZua2g}{yYpuu9D7Q0q3QHprUnpPVQ}{10.179.192.121}{10.179.192.121:8081}{ml.machine_memory=8182054912, rack=dev, xpack.installed=true, ml.max_open_jobs=20} elect leader, _BECOME_MASTER_TASK_, _FINISH_ELECTION_], term: 18, version: 112085, reason: master node changed {previous [], current [{ausdlovpes01_1}{2jifvTz5SeuUc1MljZua2g}{yYpuu9D7Q0q3QHprUnpPVQ}{10.179.192.121}{10.179.192.121:8081}{ml.machine_memory=8182054912, rack=dev, xpack.installed=true, ml.max_open_jobs=20}]}
[2020-03-19T07:45:48,542][INFO ][o.e.c.s.ClusterApplierService] [ausdlovpes01_1] master node changed {previous [], current [{ausdlovpes01_1}{2jifvTz5SeuUc1MljZua2g}{yYpuu9D7Q0q3QHprUnpPVQ}{10.179.192.121}{10.179.192.121:8081}{ml.machine_memory=8182054912, rack=dev, xpack.installed=true, ml.max_open_jobs=20}]}, term: 18, version: 112085, reason: Publication{term=18, version=112085}
[2020-03-19T07:45:48,591][INFO ][o.e.h.AbstractHttpServerTransport] [ausdlovpes01_1] publish_address {10.179.192.121:8080}, bound_addresses {10.179.192.121:8080}
[2020-03-19T07:45:48,591][INFO ][o.e.n.Node ] [ausdlovpes01_1] started
[2020-03-19T07:45:49,271][TRACE][o.e.d.PeerFinder ] [ausdlovpes01_1] not active
[2020-03-19T07:46:15,860][INFO ][o.e.c.r.a.AllocationService] [ausdlovpes01_1] updating number_of_replicas to [1] for indices [.kibana_task_manager, .kibana_2, .kibana_1, .tasks]
[2020-03-19T07:46:15,862][INFO ][o.e.c.s.MasterService ] [ausdlovpes01_1] node-join[{ausilovpes01_1}{J8s6PJ27SCa5ymJsA41Vzg}{gkjxsliJRKiG6RDswo380A}{10.179.200.12}{10.179.200.12:8081}{ml.machine_memory=8182046720, rack=dev_replica, ml.max_open_jobs=20, xpack.installed=true} join existing leader], term: 18, version: 112086, reason: added {{ausilovpes01_1}{J8s6PJ27SCa5ymJsA41Vzg}{gkjxsliJRKiG6RDswo380A}{10.179.200.12}{10.179.200.12:8081}{ml.machine_memory=8182046720, rack=dev_replica, ml.max_open_jobs=20, xpack.installed=true},}
[2020-03-19T07:46:16,029][INFO ][o.e.c.s.ClusterApplierService] [ausdlovpes01_1] added {{ausilovpes01_1}{J8s6PJ27SCa5ymJsA41Vzg}{gkjxsliJRKiG6RDswo380A}{10.179.200.12}{10.179.200.12:8081}{ml.machine_memory=8182046720, rack=dev_replica, ml.max_open_jobs=20, xpack.installed=true},}, term: 18, version: 112086, reason: Publication{term=18, version=112086}
[2020-03-19T07:46:16,114][INFO ][o.e.c.r.a.DiskThresholdMonitor] [ausdlovpes01_1] low disk watermark [85%] exceeded on [2jifvTz5SeuUc1MljZua2g][ausdlovpes01_1][/u01/es/data/es_01/nodes/0] free: 4.4gb[11.3%], replicas will not be assigned to this node
[2020-03-19T07:46:16,115][INFO ][o.e.c.r.a.DiskThresholdMonitor] [ausdlovpes01_1] low disk watermark [85%] exceeded on [J8s6PJ27SCa5ymJsA41Vzg][ausilovpes01_1][/u01/es/data/es_01/nodes/0] free: 5.8gb[14.9%], replicas will not be assigned to this node
[2020-03-19T07:46:16,402][INFO ][o.e.l.LicenseService ] [ausdlovpes01_1] license [138c5a33-124d-4cd9-8dfd-c1e41e814366] mode [basic] - valid
[2020-03-19T07:46:16,415][INFO ][o.e.g.GatewayService ] [ausdlovpes01_1] recovered [5] indices into cluster_state
[2020-03-19T07:46:18,030][INFO ][o.e.c.r.a.AllocationService] [ausdlovpes01_1] updating number_of_replicas to [0] for indices [.kibana_task_manager, .kibana_2, .kibana_1, .tasks]
[2020-03-19T07:46:18,031][INFO ][o.e.c.s.MasterService ] [ausdlovpes01_1] node-left[{ausilovpes01_1}{J8s6PJ27SCa5ymJsA41Vzg}{gkjxsliJRKiG6RDswo380A}{10.179.200.12}{10.179.200.12:8081}{ml.machine_memory=8182046720, rack=dev_replica, ml.max_open_jobs=20, xpack.installed=true} followers check retry count exceeded], term: 18, version: 112091, reason: removed {{ausilovpes01_1}{J8s6PJ27SCa5ymJsA41Vzg}{gkjxsliJRKiG6RDswo380A}{10.179.200.12}{10.179.200.12:8081}{ml.machine_memory=8182046720, rack=dev_replica, ml.max_open_jobs=20, xpack.installed=true},}
[2020-03-19T07:46:18,103][INFO ][o.e.c.s.ClusterApplierService] [ausdlovpes01_1] removed {{ausilovpes01_1}{J8s6PJ27SCa5ymJsA41Vzg}{gkjxsliJRKiG6RDswo380A}{10.179.200.12}{10.179.200.12:8081}{ml.machine_memory=8182046720, rack=dev_replica, ml.max_open_jobs=20, xpack.installed=true},}, term: 18, version: 112091, reason: Publication{term=18, version=112091}
[2020-03-19T07:46:18,187][INFO ][o.e.c.r.a.AllocationService] [ausdlovpes01_1] updating number_of_replicas to [1] for indices [.kibana_task_manager, .kibana_2, .kibana_1, .tasks]
(and this node-join node-left continues...)
The server space of node ausdlovpes01_1 is used at 89% leaving 4.5 GB (not sure if this has could have caused any exception)