I have a 1 client/ 2 data/ 3 master Elasticsearch 6.2.2 cluster spread over 3 node vpshere environment. This is a kubernetes system. Data and master are statefulsets. Master has a service sitting on it acting as the discovery service. The installation is successful and we can see the data in Kibana, however, If all 3 masters happen to go down together, they come up with new IPs and discover each other, but data and client cant seem to discover the new masters, the logs tell me they are still trying to hit the old IPs.
Master log:
[2018-07-18T23:13:52,533][INFO ][o.e.b.BootstrapChecks ] [platform-elasticsearch-master-0] bound or publishing to a non-loopback address, enforcing bootstrap checks [2018-07-18T23:13:52,943][INFO ][o.e.m.j.JvmGcMonitorService] [platform-elasticsearch-master-0] [gc][1] overhead, spent [372ms] collecting in the last [1s] [2018-07-18T23:13:55,890][INFO ][o.e.c.s.ClusterApplierService] [platform-elasticsearch-master-0] detected_master {platform-elasticsearch-master-1}{Or4qtPKjRyajMI-iC9mnwg}{Aw3yHwDhSOu3DiocgUParg}{172.24.0.99}{172.24.0.99:9300}, added {{platform-elasticsearch-client-78d74649fc-7976p}{TkuEXJttR1C93sOuYheVTg}{6vsOtJxFTC6AdaPzNrJNbQ}{172.24.2.212}{172.24.2.212:9300},{platform-elasticsearch-master-2}{bpIqzNyKS86Sk8vPRHSy7A}{RD3XUV1TTKaD1cjOSuxf7g}{172.24.1.104}{172.24.1.104:9300},{platform-elasticsearch-data-0}{cpsl1MgFT0-c15IqgVyX7w}{hVWwSzwuQ6arlVVDEO_JkQ}{172.24.2.214}{172.24.2.214:9300},{platform-elasticsearch-data-1}{CKsi0wfrQtC421UuhS4EUQ}{tn-w2myFQkmN1nV1kaHB0g}{172.24.1.105}{172.24.1.105:9300},{platform-elasticsearch-master-1}{Or4qtPKjRyajMI-iC9mnwg}{Aw3yHwDhSOu3DiocgUParg}{172.24.0.99}{172.24.0.99:9300},}, reason: apply cluster state (from master [master {platform-elasticsearch-master-1}{Or4qtPKjRyajMI-iC9mnwg}{Aw3yHwDhSOu3DiocgUParg}{172.24.0.99}{172.24.0.99:9300} committed version [2]]) [2018-07-18T23:13:56,205][INFO ][o.e.n.Node ] [platform-elasticsearch-master-0] started
Client logs:
[2018-07-18T22:49:34,640][WARN ][o.e.d.z.ZenDiscovery ] [platform-elasticsearch-client-5d5cff75d9-k29x7] not enough master nodes discovered during pinging (found [[]], but needed [2]), pinging again [2018-07-18T22:49:35,635][WARN ][o.e.c.NodeConnectionsService] [platform-elasticsearch-client-5d5cff75d9-k29x7] failed to connect to node {platform-elasticsearch-master-1}{IPe2h_kETombpzD2ZSnetA}{ACKXkGpVQ8CMv5td3tAHxQ}{172.24.1.101}{172.24.1.101:9300} (tried [7] times) org.elasticsearch.transport.ConnectTransportException: [platform-elasticsearch-master-1][172.24.1.101:9300] connect_timeout[30s] at org.elasticsearch.transport.TcpChannel.awaitConnected(TcpChannel.java:163) ~[elasticsearch-6.2.2.jar:6.2.2] at org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:616) ~[elasticsearch-6.2.2.jar:6.2.2] at org.elasticsearch.transport.TcpTransport.connectToNode(TcpTransport.java:513) ~[elasticsearch-6.2.2.jar:6.2.2] at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:331) ~[elasticsearch-6.2.2.jar:6.2.2] at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:318) ~[elasticsearch-6.2.2.jar:6.2.2] at org.elasticsearch.cluster.NodeConnectionsService.validateAndConnectIfNeeded(NodeConnectionsService.java:154) [elasticsearch-6.2.2.jar:6.2.2] at org.elasticsearch.cluster.NodeConnectionsService$ConnectionChecker.doRun(NodeConnectionsService.java:183) [elasticsearch-6.2.2.jar:6.2.2] at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:672) [elasticsearch-6.2.2.jar:6.2.2] at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.2.2.jar:6.2.2] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_161] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_161] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_161]
master/client config:
cluster.name: ${CLUSTER_NAME:true}
node.name: ${NODE_NAME:}
node.master: ${NODE_MASTER:}
node.data: ${NODE_DATA:}
node.ingest: ${NODE_INGEST:}
network.host: ${NETWORK_HOST:0.0.0.0}
path:
data: /usr/share/elasticsearch/data
logs: /usr/share/elasticsearch/logs
bootstrap:
memory_lock: ${MEMORY_LOCK:false}
http:
enabled: ${HTTP_ENABLE:false}
discovery:
zen:
ping.unicast.hosts: ${DISCOVERY_SERVICE:}
minimum_master_nodes: ${MINIMUM_NUMBER_OF_MASTERS:1}
commit_timeout: 60s
publish_timeout : 60s
gateway.expected_nodes: 5
gateway.expected_master_nodes: 3
gateway.expected_data_nodes: 2
gateway.recover_after_nodes: 2
gateway.recover_after_master_nodes: 2
gateway.recover_after_data_nodes: 1
NOTE: Restarting the client fixes the issue. But why is this happening?
Please help!