Two master nodes loose connection to the third master node, after its IP changes

We are running 3 master nodes in k8s env (as a stateful set).
When one of the master nodes dies and changes its IP, the other two master nodes are not able to find it.
Any insights on what we can do better ?

{"timestamp":"2022-02-26T02:40:55,852Z","level":"WARN","component":"o.e.c.NodeConnectionsService","cluster.name":"elasticsearch-logging-us-west1","node.name":"elasticsearch-logging-master-us-west1-a-0","message":"failed to connect to {elasticsearch-logging-master-us-west1-a-1}{4msBo_zpQD27irf5y4WRJw}{9i-AO_JYThSusw9Cgn3leg}{10.144.68.18}{10.144.68.18:9300}{m}{region=us-west1, zone=us-west1-a, storageclass=pd-standard} (tried [13[] times)", "stacktrace": ["org.elasticsearch.transport.ConnectTransportException: [elasticsearch-logging-master-us-west1-a-1[][10.144.68.18:9300[] connect_timeout[30s[]","at org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onTimeout(TcpTransport.java:984) ~[elasticsearch-7.10.2.jar:7.10.2[]","at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:684) ~[elasticsearch-7.10.2.jar:7.10.2[]","at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]","at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]","at java.lang.Thread.run(Thread.java:832) [?:?]"]}
org.elasticsearch.transport.ConnectTransportException: [elasticsearch-logging-master-us-west1-a-1[][10.144.68.18:9300[] connect_timeout[30s[]
    at org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onTimeout(TcpTransport.java:984) ~[elasticsearch-7.10.2.jar:7.10.2[]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:684) ~[elasticsearch-7.10.2.jar:7.10.2[]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
    at java.lang.Thread.run(Thread.java:832) [?:?]

Note that [elasticsearch-logging-master-us-west1-a-1][10.144.68.18:9300] connect_timeout[30s] is still the old IP.

Logs on the a-1 node show:


{"timestamp":"2022-02-26T03:02:26,701Z","level":"WARN","component":"o.e.c.c.ClusterFormationFailureHelper","cluster.name":"elasticsearch-logging-us-west1","node.name":"elasticsearch-logging-master-us-west1-a-1","message":"master not discovered or elected yet, an election requires at least 2 nodes with ids from [4msBo_zpQD27irf5y4WRJw, a2Zi_wdITIafm5meTnWxtA, VN24LF_rQb-krhtlB8gTnw[], have discovered [{elasticsearch-logging-master-us-west1-a-1}{4msBo_zpQD27irf5y4WRJw}{opBjbURuQvKGl0fGwLDBpQ}{10.144.1.58}{10.144.1.58:9300}{m}{zone=us-west1-a, region=us-west1, storageclass=pd-standard}, {elasticsearch-logging-master-us-west1-a-0}{a2Zi_wdITIafm5meTnWxtA}{BaMEs3LIS-KWxjqGJ3FTZA}{10.144.80.13}{10.144.80.13:9300}{m}{region=us-west1, zone=us-west1-a, storageclass=pd-standard}, {elasticsearch-logging-master-us-west1-a-2}{VN24LF_rQb-krhtlB8gTnw}{dAJmaR7gS5-eGR4YsCDiuA}{10.144.80.23}{10.144.80.23:9300}{m}{region=us-west1, zone=us-west1-a, storageclass=pd-standard}] which is a quorum; discovery will continue using [10.144.80.13:9300, 10.144.80.23:9300[] from hosts providers and [{elasticsearch-logging-master-us-west1-a-1}{4msBo_zpQD27irf5y4WRJw}{opBjbURuQvKGl0fGwLDBpQ}{10.144.1.58}{10.144.1.58:9300}{m}{zone=us-west1-a, region=us-west1, storageclass=pd-standard}] from last-known cluster state; node term 337, last-accepted version 41070276 in term 336"}

The two log messages you shared don't show the full story: they confirm that elasticsearch-logging-master-us-west1-a-1 can indeed find elasticsearch-logging-master-us-west1-a-0 and elasticsearch-logging-master-us-west1-a-2, so there's something else wrong. Would you share complete logs from this node, covering at least 10 minutes?

Also 7.10 is pretty old, almost at EOL, and there have been improvements to the log messages in this situation in more recent versions. Would you upgrade to pick up these improvements?

Hmm, that's strange in a couple of ways. Firstly, it should have given up on this node long before trying 13 times, but also there's an extra [ in this message that Elasticsearch wouldn't have put there. Have you modified these logs in any way? Could you share the output of GET /?

Okay - I have the full logs here:
master0: master-0.log · GitHub
master1: master1 · GitHub
master2: master2.log · GitHub

I'm trying to analyze these also, but meanwhile thought it maybe helpful to send these as is. Please let me know if you see any sensitive info here that I should redact.

This looks odd - publication failed

{"timestamp":"2022-02-25T18:54:50,282Z","level":"ERROR","component":"o.e.c.c.Coordinator","cluster.name":"elasticsearch-logging-us-west1","node.name":"elasticsearch-logging-master-us-west1-a-1","message":"unexpected failure during [node-left]", "stacktrace": ["org.elasticsearch.cluster.coordination.FailedToCommitClusterStateException: publication failed","at org.elasticsearch.cluster.coordination.Coordinator$CoordinatorPublication$4.onFailure(Coordinator.java:1467) ~[elasticsearch-7.10.2.jar:7.10.2]","at org.elasticsearch.action.ActionRunnable.onFailure(ActionRunnable.java:88) ~[elasticsearch-7.10.2.jar:7.10.2]","at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:39) ~[elasticsearch-7.10.2.jar:7.10.2]","at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:224) ~[elasticsearch-7.10.2.jar:7.10.2]","at org.elasticsearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:106) ~[elasticsearch-7.10.2.jar:7.10.2]","at org.elasticsearch.common.util.concurrent.ListenableFuture.addListener(ListenableFuture.java:68) ~[elasticsearch-7.10.2.jar:7.10.2]","at org.elasticsearch.cluster.coordination.Coordinator$CoordinatorPublication.onCompletion(Coordinator.java:1390) ~[elasticsearch-7.10.2.jar:7.10.2]","at org.elasticsearch.cluster.coordination.Publication.onPossibleCompletion(Publication.java:125) ~[elasticsearch-7.10.2.jar:7.10.2]","at org.elasticsearch.cluster.coordination.Publication.onPossibleCommitFailure(Publication.java:173) ~[elasticsearch-7.10.2.jar:7.10.2]","at org.elasticsearch.cluster.coordination.Publication.start(Publication.java:72) ~[elasticsearch-7.10.2.jar:7.10.2]","at org.elasticsearch.cluster.coordination.Coordinator.publish(Coordinator.java:1115) ~[elasticsearch-7.10.2.jar:7.10.2]","at org.elasticsearch.cluster.service.MasterService.publish(MasterService.java:268) [elasticsearch-7.10.2.jar:7.10.2]","at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:250) [elasticsearch-7.10.2.jar:7.10.2]","at org.elasticsearch.cluster.service.MasterService.access$000(MasterService.java:73) [elasticsearch-7.10.2.jar:7.10.2]","at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:151) [elasticsearch-7.10.2.jar:7.10.2]","at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-7.10.2.jar:7.10.2]","at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-7.10.2.jar:7.10.2]","at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:684) [elasticsearch-7.10.2.jar:7.10.2]","at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-7.10.2.jar:7.10.2]","at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-7.10.2.jar:7.10.2]","at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]","at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]","at java.lang.Thread.run(Thread.java:832) [?:?]","Caused by: org.elasticsearch.cluster.coordination.FailedToCommitClusterStateException: non-failed nodes do not form a quorum","at org.elasticsearch.cluster.coordination.Publication.onPossibleCommitFailure(Publication.java:171) ~[elasticsearch-7.10.2.jar:7.10.2]","... 14 more"]}
org.elasticsearch.cluster.coordination.FailedToCommitClusterStateException: publication failed

From the logs it appears you are using OpenDistro security, which is a third-party plugin and not supported here. It would be involved and affect the communication between nodes so it can not be ruled out it is having an impact. I would recommend you uninstall the plugin and check if the issue persists or contact the OpenDistro community.

2 Likes

Opendistro is an AWS run product and differs from the original Elasticsearch and Kibana products that Elastic builds and maintains. You may need to contact them directly for further assistance.

(This is an automated response from your friendly Elastic bot. Please report this post if you have any suggestions or concerns :elasticheart: )

As Christian and the bot say: OpenDistro contains some nonstandard plugins that modify Elasticsearch in ways that relate to the problems you report. If you can reproduce this with a stock Elasticsearch build then we can dig deeper, but until you do that we'll not be able to help further.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.