Two master nodes loose connection to the third master node, after its IP changes

vibgy · February 26, 2022, 3:01am

We are running 3 master nodes in k8s env (as a stateful set).
When one of the master nodes dies and changes its IP, the other two master nodes are not able to find it.
Any insights on what we can do better ?

{"timestamp":"2022-02-26T02:40:55,852Z","level":"WARN","component":"o.e.c.NodeConnectionsService","cluster.name":"elasticsearch-logging-us-west1","node.name":"elasticsearch-logging-master-us-west1-a-0","message":"failed to connect to {elasticsearch-logging-master-us-west1-a-1}{4msBo_zpQD27irf5y4WRJw}{9i-AO_JYThSusw9Cgn3leg}{10.144.68.18}{10.144.68.18:9300}{m}{region=us-west1, zone=us-west1-a, storageclass=pd-standard} (tried [13[] times)", "stacktrace": ["org.elasticsearch.transport.ConnectTransportException: [elasticsearch-logging-master-us-west1-a-1[][10.144.68.18:9300[] connect_timeout[30s[]","at org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onTimeout(TcpTransport.java:984) ~[elasticsearch-7.10.2.jar:7.10.2[]","at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:684) ~[elasticsearch-7.10.2.jar:7.10.2[]","at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]","at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]","at java.lang.Thread.run(Thread.java:832) [?:?]"]}
org.elasticsearch.transport.ConnectTransportException: [elasticsearch-logging-master-us-west1-a-1[][10.144.68.18:9300[] connect_timeout[30s[]
    at org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onTimeout(TcpTransport.java:984) ~[elasticsearch-7.10.2.jar:7.10.2[]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:684) ~[elasticsearch-7.10.2.jar:7.10.2[]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
    at java.lang.Thread.run(Thread.java:832) [?:?]

Note that [elasticsearch-logging-master-us-west1-a-1][10.144.68.18:9300] connect_timeout[30s] is still the old IP.

Logs on the a-1 node show:


{"timestamp":"2022-02-26T03:02:26,701Z","level":"WARN","component":"o.e.c.c.ClusterFormationFailureHelper","cluster.name":"elasticsearch-logging-us-west1","node.name":"elasticsearch-logging-master-us-west1-a-1","message":"master not discovered or elected yet, an election requires at least 2 nodes with ids from [4msBo_zpQD27irf5y4WRJw, a2Zi_wdITIafm5meTnWxtA, VN24LF_rQb-krhtlB8gTnw[], have discovered [{elasticsearch-logging-master-us-west1-a-1}{4msBo_zpQD27irf5y4WRJw}{opBjbURuQvKGl0fGwLDBpQ}{10.144.1.58}{10.144.1.58:9300}{m}{zone=us-west1-a, region=us-west1, storageclass=pd-standard}, {elasticsearch-logging-master-us-west1-a-0}{a2Zi_wdITIafm5meTnWxtA}{BaMEs3LIS-KWxjqGJ3FTZA}{10.144.80.13}{10.144.80.13:9300}{m}{region=us-west1, zone=us-west1-a, storageclass=pd-standard}, {elasticsearch-logging-master-us-west1-a-2}{VN24LF_rQb-krhtlB8gTnw}{dAJmaR7gS5-eGR4YsCDiuA}{10.144.80.23}{10.144.80.23:9300}{m}{region=us-west1, zone=us-west1-a, storageclass=pd-standard}] which is a quorum; discovery will continue using [10.144.80.13:9300, 10.144.80.23:9300[] from hosts providers and [{elasticsearch-logging-master-us-west1-a-1}{4msBo_zpQD27irf5y4WRJw}{opBjbURuQvKGl0fGwLDBpQ}{10.144.1.58}{10.144.1.58:9300}{m}{zone=us-west1-a, region=us-west1, storageclass=pd-standard}] from last-known cluster state; node term 337, last-accepted version 41070276 in term 336"}

DavidTurner · February 26, 2022, 9:41am

The two log messages you shared don't show the full story: they confirm that elasticsearch-logging-master-us-west1-a-1 can indeed find elasticsearch-logging-master-us-west1-a-0 and elasticsearch-logging-master-us-west1-a-2, so there's something else wrong. Would you share complete logs from this node, covering at least 10 minutes?

Also 7.10 is pretty old, almost at EOL, and there have been improvements to the log messages in this situation in more recent versions. Would you upgrade to pick up these improvements?

DavidTurner · February 26, 2022, 9:47am

Hmm, that's strange in a couple of ways. Firstly, it should have given up on this node long before trying 13 times, but also there's an extra [ in this message that Elasticsearch wouldn't have put there. Have you modified these logs in any way? Could you share the output of GET /?

vibgy · February 28, 2022, 6:18pm

Okay - I have the full logs here:
master0: master-0.log · GitHub
master1: master1 · GitHub
master2: master2.log · GitHub

I'm trying to analyze these also, but meanwhile thought it maybe helpful to send these as is. Please let me know if you see any sensitive info here that I should redact.

vibgy · February 28, 2022, 6:35pm

This looks odd - publication failed

{"timestamp":"2022-02-25T18:54:50,282Z","level":"ERROR","component":"o.e.c.c.Coordinator","cluster.name":"elasticsearch-logging-us-west1","node.name":"elasticsearch-logging-master-us-west1-a-1","message":"unexpected failure during [node-left]", "stacktrace": ["org.elasticsearch.cluster.coordination.FailedToCommitClusterStateException: publication failed","at org.elasticsearch.cluster.coordination.Coordinator$CoordinatorPublication$4.onFailure(Coordinator.java:1467) ~[elasticsearch-7.10.2.jar:7.10.2]","at org.elasticsearch.action.ActionRunnable.onFailure(ActionRunnable.java:88) ~[elasticsearch-7.10.2.jar:7.10.2]","at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:39) ~[elasticsearch-7.10.2.jar:7.10.2]","at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:224) ~[elasticsearch-7.10.2.jar:7.10.2]","at org.elasticsearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:106) ~[elasticsearch-7.10.2.jar:7.10.2]","at org.elasticsearch.common.util.concurrent.ListenableFuture.addListener(ListenableFuture.java:68) ~[elasticsearch-7.10.2.jar:7.10.2]","at org.elasticsearch.cluster.coordination.Coordinator$CoordinatorPublication.onCompletion(Coordinator.java:1390) ~[elasticsearch-7.10.2.jar:7.10.2]","at org.elasticsearch.cluster.coordination.Publication.onPossibleCompletion(Publication.java:125) ~[elasticsearch-7.10.2.jar:7.10.2]","at org.elasticsearch.cluster.coordination.Publication.onPossibleCommitFailure(Publication.java:173) ~[elasticsearch-7.10.2.jar:7.10.2]","at org.elasticsearch.cluster.coordination.Publication.start(Publication.java:72) ~[elasticsearch-7.10.2.jar:7.10.2]","at org.elasticsearch.cluster.coordination.Coordinator.publish(Coordinator.java:1115) ~[elasticsearch-7.10.2.jar:7.10.2]","at org.elasticsearch.cluster.service.MasterService.publish(MasterService.java:268) [elasticsearch-7.10.2.jar:7.10.2]","at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:250) [elasticsearch-7.10.2.jar:7.10.2]","at org.elasticsearch.cluster.service.MasterService.access$000(MasterService.java:73) [elasticsearch-7.10.2.jar:7.10.2]","at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:151) [elasticsearch-7.10.2.jar:7.10.2]","at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-7.10.2.jar:7.10.2]","at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-7.10.2.jar:7.10.2]","at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:684) [elasticsearch-7.10.2.jar:7.10.2]","at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-7.10.2.jar:7.10.2]","at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-7.10.2.jar:7.10.2]","at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]","at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]","at java.lang.Thread.run(Thread.java:832) [?:?]","Caused by: org.elasticsearch.cluster.coordination.FailedToCommitClusterStateException: non-failed nodes do not form a quorum","at org.elasticsearch.cluster.coordination.Publication.onPossibleCommitFailure(Publication.java:171) ~[elasticsearch-7.10.2.jar:7.10.2]","... 14 more"]}
org.elasticsearch.cluster.coordination.FailedToCommitClusterStateException: publication failed

Christian_Dahlqvist · February 28, 2022, 6:48pm

From the logs it appears you are using OpenDistro security, which is a third-party plugin and not supported here. It would be involved and affect the communication between nodes so it can not be ruled out it is having an impact. I would recommend you uninstall the plugin and check if the issue persists or contact the OpenDistro community.

system · February 28, 2022, 6:48pm

Opendistro is an AWS run product and differs from the original Elasticsearch and Kibana products that Elastic builds and maintains. You may need to contact them directly for further assistance.

(This is an automated response from your friendly Elastic bot. Please report this post if you have any suggestions or concerns )

DavidTurner · February 28, 2022, 8:59pm

As Christian and the bot say: OpenDistro contains some nonstandard plugins that modify Elasticsearch in ways that relate to the problems you report. If you can reproduce this with a stock Elasticsearch build then we can dig deeper, but until you do that we'll not be able to help further.

system · March 28, 2022, 8:59pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Client node unable to discover master node after master restarts and gets new IP Elasticsearch	3	847	August 16, 2018
One failed data node cause http connection to master node (6 data nodes) disconnected Elasticsearch	1	558	October 19, 2017
ElasticSearch 0.92 issue when stop Client Node Elasticsearch	1	331	July 6, 2017
Elasticsearch cluster nodes , what happen if connected ip fails Elasticsearch	4	497	June 16, 2018
Multiple nodes on elasticsearch Elasticsearch	11	833	November 21, 2018

Two master nodes loose connection to the third master node, after its IP changes

Related topics