We are running 3 master nodes in k8s env (as a stateful set).
When one of the master nodes dies and changes its IP, the other two master nodes are not able to find it.
Any insights on what we can do better ?
{"timestamp":"2022-02-26T02:40:55,852Z","level":"WARN","component":"o.e.c.NodeConnectionsService","cluster.name":"elasticsearch-logging-us-west1","node.name":"elasticsearch-logging-master-us-west1-a-0","message":"failed to connect to {elasticsearch-logging-master-us-west1-a-1}{4msBo_zpQD27irf5y4WRJw}{9i-AO_JYThSusw9Cgn3leg}{10.144.68.18}{10.144.68.18:9300}{m}{region=us-west1, zone=us-west1-a, storageclass=pd-standard} (tried [13[] times)", "stacktrace": ["org.elasticsearch.transport.ConnectTransportException: [elasticsearch-logging-master-us-west1-a-1[][10.144.68.18:9300[] connect_timeout[30s[]","at org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onTimeout(TcpTransport.java:984) ~[elasticsearch-7.10.2.jar:7.10.2[]","at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:684) ~[elasticsearch-7.10.2.jar:7.10.2[]","at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]","at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]","at java.lang.Thread.run(Thread.java:832) [?:?]"]}
org.elasticsearch.transport.ConnectTransportException: [elasticsearch-logging-master-us-west1-a-1[][10.144.68.18:9300[] connect_timeout[30s[]
at org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onTimeout(TcpTransport.java:984) ~[elasticsearch-7.10.2.jar:7.10.2[]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:684) ~[elasticsearch-7.10.2.jar:7.10.2[]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
at java.lang.Thread.run(Thread.java:832) [?:?]
Note that [elasticsearch-logging-master-us-west1-a-1][10.144.68.18:9300] connect_timeout[30s]
is still the old IP.
Logs on the a-1 node show:
{"timestamp":"2022-02-26T03:02:26,701Z","level":"WARN","component":"o.e.c.c.ClusterFormationFailureHelper","cluster.name":"elasticsearch-logging-us-west1","node.name":"elasticsearch-logging-master-us-west1-a-1","message":"master not discovered or elected yet, an election requires at least 2 nodes with ids from [4msBo_zpQD27irf5y4WRJw, a2Zi_wdITIafm5meTnWxtA, VN24LF_rQb-krhtlB8gTnw[], have discovered [{elasticsearch-logging-master-us-west1-a-1}{4msBo_zpQD27irf5y4WRJw}{opBjbURuQvKGl0fGwLDBpQ}{10.144.1.58}{10.144.1.58:9300}{m}{zone=us-west1-a, region=us-west1, storageclass=pd-standard}, {elasticsearch-logging-master-us-west1-a-0}{a2Zi_wdITIafm5meTnWxtA}{BaMEs3LIS-KWxjqGJ3FTZA}{10.144.80.13}{10.144.80.13:9300}{m}{region=us-west1, zone=us-west1-a, storageclass=pd-standard}, {elasticsearch-logging-master-us-west1-a-2}{VN24LF_rQb-krhtlB8gTnw}{dAJmaR7gS5-eGR4YsCDiuA}{10.144.80.23}{10.144.80.23:9300}{m}{region=us-west1, zone=us-west1-a, storageclass=pd-standard}] which is a quorum; discovery will continue using [10.144.80.13:9300, 10.144.80.23:9300[] from hosts providers and [{elasticsearch-logging-master-us-west1-a-1}{4msBo_zpQD27irf5y4WRJw}{opBjbURuQvKGl0fGwLDBpQ}{10.144.1.58}{10.144.1.58:9300}{m}{zone=us-west1-a, region=us-west1, storageclass=pd-standard}] from last-known cluster state; node term 337, last-accepted version 41070276 in term 336"}