[ES6 upgrade ES7] The two-node cluster works normally, but the master node fails, the slave node tries to reconnect, and does not recommend itself as the new master node

LiZhe1258 · October 30, 2019, 2:39am

1，服务器：x86_64 x86_64 x86_64 GNU/Linux，双节点(log1和log2)，小网通信
2，软件版本：elasticsearch7.3.1，不带x-pack增强功能
3，工作模式：既作为master候选节点，也作为数据节点，log1的主分片在自己这，副本分片在log2那里，log2同理
4，elasticsearch.yml配置：除名称外两节点一致
cluster.name: log-search-cluster
node.name: node_20.23.72.14
path.data: /var/share/oss/xxxxx/LogConfigService/es
path.logs: /opt/oss/log/xxxxx/LogConfigService/es
bootstrap.memory_lock: true
network.host: 20.23.72.14
transport.tcp.port: 27336
discovery.zen.ping_timeout: 60s
client.transport.ping_timeout: 60s
cluster.join.timeout: 30s
discovery.seed_hosts: [20.23.72.15, 20.23.72.14]
cluster.initial_master_nodes: [node_20.23.72.15, node_20.23.72.14]
5，上下文信息：
背景：ES6.3.1全部重启升级至ES7.3.1后，非滚动升级方式
当前运行状况：
1，双节点集群模式：双节点集群工作正常，每个节点48个分片，主1备1，副本分片置于另一个节点上，数据一致，无异常
2，双节点有一个停止运行：
A，如果slave节点挂掉，master节点正常运行，单节点集群正常工作，可搜索、写入
B, 如果master节点挂掉，slave节点工作不正常，会一直尝试重连原先的master节点，导致功能阻塞（想让slave节点升级为master节点，单节点正常工作，一旦原master恢复，原来的master变成slave节点加入集群）
在ES6中如何处理的：
通过人为控制discovery.zen.minimum_master_nodes该参数，实现双节点可集群工作，可单节点（master和slave都支持）独立工作。双节点集群时，该数目为2，达到投票数选出master节点，正常工作；单个节点时，该数目变为1，达到投票数，自己成为master节点，主分片正常运行，副本分片未分配，但功能正常，这就是我想要的效果！
ES7中如何处理的：
移除原有的zen集群发现模块，引入新的集群协调子系统，discovery.zen.minimum_master_nodes的配置被忽略，不再生效，由ES自己选择可以形成仲裁的节点，导致当前主节点挂掉后，投票数不够，选不出Master节点，阻塞功能正常运行，然后一直尝试重连原来的master节点！
6，问题描述：
es升级前，双节点既可以集群工作，也可以独立节点工作（无论master还是slave节点），移除/加入节点，工作模式会动态变更
es升级后，双节点集群工作正常，master节点可独立工作，slave节点一直尝试重连master节点，功能阻塞(重启slave节点是可以独立运行的，但我不希望重启)，我希望slave节点在几次连接失败后，直接推选自己为master节点，独立节点工作，保证异常场景下功能正常。
7，联系邮箱：
15195895896@163.com
求助大神帮忙！

截图无法上传，帖子下方另外附录堆栈信息

LiZhe1258 · October 30, 2019, 2:49am

集群正常工作的堆栈，master节点（log1）:

[2019-10-30T10:21:36,500][INFO ][o.e.c.c.Coordinator ] [node_20.23.72.14] cluster UUID [O67XhXA9TRG78Qe5yP2Tag]
[2019-10-30T10:21:36,679][INFO ][o.e.c.s.MasterService ] [node_20.23.72.14] elected-as-master ([1] nodes joined)[{node_20.23.72.14}{pyKYCHjPQLGcM3EwI_Dc8A}{Tpu2X9YXR4a5Ockkaby0oQ}{20.23.72.14}{20.23.72.14:27336}{dim} elect leader, BECOME_MASTER_TASK, FINISH_ELECTION], term: 10, version: 472, reason: master node changed {previous , current [{node_20.23.72.14}{pyKYCHjPQLGcM3EwI_Dc8A}{Tpu2X9YXR4a5Ockkaby0oQ}{20.23.72.14}{20.23.72.14:27336}{dim}]}
[2019-10-30T10:21:36,960][INFO ][o.e.c.s.ClusterApplierService] [node_20.23.72.14] master node changed {previous , current [{node_20.23.72.14}{pyKYCHjPQLGcM3EwI_Dc8A}{Tpu2X9YXR4a5Ockkaby0oQ}{20.23.72.14}{20.23.72.14:27336}{dim}]}, term: 10, version: 472, reason: Publication{term=10, version=472}
[2019-10-30T10:21:37,000][INFO ][o.e.h.AbstractHttpServerTransport] [node_20.23.72.14] publish_address {20.23.72.14:9200}, bound_addresses {20.23.72.14:9200}
[2019-10-30T10:21:37,001][INFO ][o.e.n.Node ] [node_20.23.72.14] started
[2019-10-30T10:21:37,505][INFO ][o.e.g.GatewayService ] [node_20.23.72.14] recovered [12] indices into cluster_state
[2019-10-30T10:21:37,507][INFO ][o.e.c.s.MasterService ] [node_20.23.72.14] node-join[{node_20.23.72.15}{JaQ87iQpS0qLomhZvl-tPQ}{xFtBz9RHQ1an5sF3ebLtbg}{20.23.72.15}{20.23.72.15:27336}{dim} join existing leader], term: 10, version: 474, reason: added {{node_20.23.72.15}{JaQ87iQpS0qLomhZvl-tPQ}{xFtBz9RHQ1an5sF3ebLtbg}{20.23.72.15}{20.23.72.15:27336}{dim},}
[2019-10-30T10:21:37,770][INFO ][o.e.c.s.ClusterApplierService] [node_20.23.72.14] added {{node_20.23.72.15}{JaQ87iQpS0qLomhZvl-tPQ}{xFtBz9RHQ1an5sF3ebLtbg}{20.23.72.15}{20.23.72.15:27336}{dim},}, term: 10, version: 474, reason: Publication{term=10, version=474}
[2019-10-30T10:21:39,930][INFO ][o.e.c.r.a.AllocationService] [node_20.23.72.14] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[20191029_logstorage_3d_cnglobal1_ies][2], [20191029_logstorage_3d_cnglobal1_ies][1], [20191029_logstorage_3d_cnglobal1_ies][0]] ...]).
[2019-10-30T10:21:45,924][INFO ][o.e.c.r.a.AllocationService] [node_20.23.72.14] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[20191030_logstorage_3d_cnglobal1_ies][3]] ...]).

LiZhe1258 · October 30, 2019, 2:49am

集群正常工作的堆栈，slave节点（log2）:
[2019-10-30T10:21:36,046][WARN ][o.e.c.c.ClusterFormationFailureHelper] [node_20.23.72.15] master not discovered or elected yet, an election requires a node with id [pyKYCHjPQLGcM3EwI_Dc8A], have discovered [{node_20.23.72.15}{JaQ87iQpS0qLomhZvl-tPQ}{xFtBz9RHQ1an5sF3ebLtbg}{20.23.72.15}{20.23.72.15:27336}{dim}] which is not a quorum; discovery will continue using [20.23.72.14:27336] from hosts providers and [{node_20.23.72.14}{pyKYCHjPQLGcM3EwI_Dc8A}{ut-RDqM4R4SL9zxp1jPVsQ}{20.23.72.14}{20.23.72.14:27336}{dim}, {node_20.23.72.15}{JaQ87iQpS0qLomhZvl-tPQ}{xFtBz9RHQ1an5sF3ebLtbg}{20.23.72.15}{20.23.72.15:27336}{dim}] from last-known cluster state; node term 9, last-accepted version 471 in term 9
[2019-10-30T10:21:37,085][DEBUG][o.e.a.a.c.s.TransportClusterStateAction] [node_20.23.72.15] timed out while retrying [cluster:monitor/state] after failure (timeout [30s])
[2019-10-30T10:21:37,105][DEBUG][o.e.a.a.c.s.TransportClusterStateAction] [node_20.23.72.15] no known master node, scheduling a retry
[2019-10-30T10:21:37,649][INFO ][o.e.c.s.ClusterApplierService] [node_20.23.72.15] master node changed {previous , current [{node_20.23.72.14}{pyKYCHjPQLGcM3EwI_Dc8A}{Tpu2X9YXR4a5Ockkaby0oQ}{20.23.72.14}{20.23.72.14:27336}{dim}]}, removed {{node_20.23.72.14}{pyKYCHjPQLGcM3EwI_Dc8A}{ut-RDqM4R4SL9zxp1jPVsQ}{20.23.72.14}{20.23.72.14:27336}{dim},}, added {{node_20.23.72.14}{pyKYCHjPQLGcM3EwI_Dc8A}{Tpu2X9YXR4a5Ockkaby0oQ}{20.23.72.14}{20.23.72.14:27336}{dim},}, term: 10, version: 474, reason: ApplyCommitRequest{term=10, version=474, sourceNode={node_20.23.72.14}{pyKYCHjPQLGcM3EwI_Dc8A}{Tpu2X9YXR4a5Ockkaby0oQ}{20.23.72.14}{20.23.72.14:27336}{dim}}

LiZhe1258 · October 30, 2019, 2:50am

关停master节点，master节点堆栈信息：
[2019-10-30T10:24:06,092][INFO ][o.e.n.Node ] [node_20.23.72.14] stopping ...
[2019-10-30T10:24:06,541][INFO ][o.e.n.Node ] [node_20.23.72.14] stopped
[2019-10-30T10:24:06,542][INFO ][o.e.n.Node ] [node_20.23.72.14] closing ...
[2019-10-30T10:24:06,552][INFO ][o.e.n.Node ] [node_20.23.72.14] closed

LiZhe1258 · October 30, 2019, 2:52am

slave节点此时的堆栈信息，一直尝试重连master节点，但我不希望他一直阻塞在这
[2019-10-30T10:24:06,130][INFO ][o.e.c.c.Coordinator ] [node_20.23.72.15] master node [{node_20.23.72.14}{pyKYCHjPQLGcM3EwI_Dc8A}{Tpu2X9YXR4a5Ockkaby0oQ}{20.23.72.14}{20.23.72.14:27336}{dimailed, restarting discovery
org.elasticsearch.transport.NodeDisconnectedException: [node_20.23.72.14][20.23.72.14:27336][disconnected] disconnected
[2019-10-30T10:24:06,133][INFO ][o.e.c.s.ClusterApplierService] [node_20.23.72.15] master node changed {previous [{node_20.23.72.14}{pyKYCHjPQLGcM3EwI_Dc8A}{Tpu2X9YXR4a5Ockkaby0oQ}{20.23.72.14}{3.72.14:27336}{dim}], current }, term: 10, version: 528, reason: becoming candidate: onLeaderFailure
[2019-10-30T10:24:06,140][WARN ][o.e.c.NodeConnectionsService] [node_20.23.72.15] failed to connect to {node_20.23.72.14}{pyKYCHjPQLGcM3EwI_Dc8A}{Tpu2X9YXR4a5Ockkaby0oQ}{20.23.72.14}{20.23.72.1436}{dim} (tried [1] times)
org.elasticsearch.transport.ConnectTransportException: [node_20.23.72.14][20.23.72.14:27336] connect_exception
at org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onFailure(TcpTransport.java:957) ~[elasticsearch-7.3.1.jar:7.3.1]
at org.elasticsearch.action.ActionListener.lambda$toBiConsumer$3(ActionListener.java:161) ~[elasticsearch-7.3.1.jar:7.3.1]
at org.elasticsearch.common.concurrent.CompletableContext.lambda$addListener$0(CompletableContext.java:42) ~[elasticsearch-core-7.3.1.jar:7.3.1]
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) ~[?:1.8.0_212]
at java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:778) ~[?:1.8.0_212]
at java.util.concurrent.CompletableFuture.whenComplete(CompletableFuture.java:2140) ~[?:1.8.0_212]
at org.elasticsearch.common.concurrent.CompletableContext.addListener(CompletableContext.java:45) ~[elasticsearch-core-7.3.1.jar:7.3.1]
at org.elasticsearch.transport.netty4.Netty4TcpChannel.addConnectListener(Netty4TcpChannel.java:121) ~[?:?]
at org.elasticsearch.transport.TcpTransport.initiateConnection(TcpTransport.java:299) ~[elasticsearch-7.3.1.jar:7.3.1]
at org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:266) ~[elasticsearch-7.3.1.jar:7.3.1]
at org.elasticsearch.transport.ConnectionManager.internalOpenConnection(ConnectionManager.java:206) ~[elasticsearch-7.3.1.jar:7.3.1]
at org.elasticsearch.transport.ConnectionManager.connectToNode(ConnectionManager.java:104) ~[elasticsearch-7.3.1.jar:7.3.1]
at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:346) ~[elasticsearch-7.3.1.jar:7.3.1]
at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:333) ~[elasticsearch-7.3.1.jar:7.3.1]
at org.elasticsearch.cluster.NodeConnectionsService$ConnectionTarget$1.doRun(NodeConnectionsService.java:304) [elasticsearch-7.3.1.jar:7.3.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:758) [elasticsearch-7.3.1.jar:7.3.1]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.3.1.jar:7.3.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_212]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_212]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212]
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 20.23.72.14/20.23.72.14:27336
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:?]
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) ~[?:?]
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327) ~[?:?]
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:670) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:582) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:536) ~[?:?]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496) ~[?:?]
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:906) ~[?:?]
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?]
... 1 more
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:?]
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) ~[?:?]
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327) ~[?:?]
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:670) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:582) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:536) ~[?:?]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496) ~[?:?]
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:906) ~[?:?]
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?]
... 1 more

LiZhe1258 · October 30, 2019, 2:53am

接上个回复，没打完：
[2019-10-30T10:24:06,999][DEBUG][o.e.a.a.c.s.TransportClusterStateAction] [node_20.23.72.15] no known master node, scheduling a retry
[2019-10-30T10:24:16,133][WARN ][o.e.c.c.ClusterFormationFailureHelper] [node_20.23.72.15] master not discovered or elected yet, an election requires a node with id [pyKYCHjPQLGcM3EwI_Dc8A], havscovered [{node_20.23.72.15}{JaQ87iQpS0qLomhZvl-tPQ}{xFtBz9RHQ1an5sF3ebLtbg}{20.23.72.15}{20.23.72.15:27336}{dim}] which is not a quorum; discovery will continue using [20.23.72.14:27336] from h providers and [{node_20.23.72.14}{pyKYCHjPQLGcM3EwI_Dc8A}{Tpu2X9YXR4a5Ockkaby0oQ}{20.23.72.14}{20.23.72.14:27336}{dim}, {node_20.23.72.15}{JaQ87iQpS0qLomhZvl-tPQ}{xFtBz9RHQ1an5sF3ebLtbg}{20.23.5}{20.23.72.15:27336}{dim}] from last-known cluster state; node term 10, last-accepted version 528 in term 10
[2019-10-30T10:24:26,135][WARN ][o.e.c.c.ClusterFormationFailureHelper] [node_20.23.72.15] master not discovered or elected yet, an election requires a node with id [pyKYCHjPQLGcM3EwI_Dc8A], havscovered [{node_20.23.72.15}{JaQ87iQpS0qLomhZvl-tPQ}{xFtBz9RHQ1an5sF3ebLtbg}{20.23.72.15}{20.23.72.15:27336}{dim}] which is not a quorum; discovery will continue using [20.23.72.14:27336] from h providers and [{node_20.23.72.14}{pyKYCHjPQLGcM3EwI_Dc8A}{Tpu2X9YXR4a5Ockkaby0oQ}{20.23.72.14}{20.23.72.14:27336}{dim}, {node_20.23.72.15}{JaQ87iQpS0qLomhZvl-tPQ}{xFtBz9RHQ1an5sF3ebLtbg}{20.23.5}{20.23.72.15:27336}{dim}] from last-known cluster state; node term 10, last-accepted version 528 in term 10
[2019-10-30T10:24:36,136][WARN ][o.e.c.c.ClusterFormationFailureHelper] [node_20.23.72.15] master not discovered or elected yet, an election requires a node with id [pyKYCHjPQLGcM3EwI_Dc8A], havscovered [{node_20.23.72.15}{JaQ87iQpS0qLomhZvl-tPQ}{xFtBz9RHQ1an5sF3ebLtbg}{20.23.72.15}{20.23.72.15:27336}{dim}] which is not a quorum; discovery will continue using [20.23.72.14:27336] from h providers and [{node_20.23.72.14}{pyKYCHjPQLGcM3EwI_Dc8A}{Tpu2X9YXR4a5Ockkaby0oQ}{20.23.72.14}{20.23.72.14:27336}{dim}, {node_20.23.72.15}{JaQ87iQpS0qLomhZvl-tPQ}{xFtBz9RHQ1an5sF3ebLtbg}{20.23.5}{20.23.72.15:27336}{dim}] from last-known cluster state; node term 10, last-accepted version 528 in term 10
[2019-10-30T10:24:37,001][DEBUG][o.e.a.a.c.s.TransportClusterStateAction] [node_20.23.72.15] timed out while retrying [cluster:monitor/state] after failure (timeout [30s])
[2019-10-30T10:24:37,010][DEBUG][o.e.a.a.c.s.TransportClusterStateAction] [node_20.23.72.15] no known master node, scheduling a retry
[2019-10-30T10:24:46,138][WARN ][o.e.c.c.ClusterFormationFailureHelper] [node_20.23.72.15] master not discovered or elected yet, an election requires a node with id [pyKYCHjPQLGcM3EwI_Dc8A], havscovered [{node_20.23.72.15}{JaQ87iQpS0qLomhZvl-tPQ}{xFtBz9RHQ1an5sF3ebLtbg}{20.23.72.15}{20.23.72.15:27336}{dim}] which is not a quorum; discovery will continue using [20.23.72.14:27336] from h providers and [{node_20.23.72.14}{pyKYCHjPQLGcM3EwI_Dc8A}{Tpu2X9YXR4a5Ockkaby0oQ}{20.23.72.14}{20.23.72.14:27336}{dim}, {node_20.23.72.15}{JaQ87iQpS0qLomhZvl-tPQ}{xFtBz9RHQ1an5sF3ebLtbg}{20.23.5}{20.23.72.15:27336}{dim}] from last-known cluster state; node term 10, last-accepted version 528 in term 10
[2019-10-30T10:24:56,140][WARN ][o.e.c.c.ClusterFormationFailureHelper] [node_20.23.72.15] master not discovered or elected yet, an election requires a node with id [pyKYCHjPQLGcM3EwI_Dc8A], havscovered [{node_20.23.72.15}{JaQ87iQpS0qLomhZvl-tPQ}{xFtBz9RHQ1an5sF3ebLtbg}{20.23.72.15}{20.23.72.15:27336}{dim}] which is not a quorum; discovery will continue using [20.23.72.14:27336] from h providers and [{node_20.23.72.14}{pyKYCHjPQLGcM3EwI_Dc8A}{Tpu2X9YXR4a5Ockkaby0oQ}{20.23.72.14}{20.23.72.14:27336}{dim}, {node_20.23.72.15}{JaQ87iQpS0qLomhZvl-tPQ}{xFtBz9RHQ1an5sF3ebLtbg}{20.23.5}{20.23.72.15:27336}{dim}] from last-known cluster state; node term 10, last-accepted version 528 in term 10
[2019-10-30T10:25:02,898][WARN ][o.e.c.NodeConnectionsService] [node_20.23.72.15] failed to connect to {node_20.23.72.14}{pyKYCHjPQLGcM3EwI_Dc8A}{Tpu2X9YXR4a5Ockkaby0oQ}{20.23.72.14}{20.23.72.1436}{dim} (tried [7] times)

LiZhe1258 · October 30, 2019, 3:05am

纠正一下问题描述，master节点停止工作后 [2019-10-30T10:35:52,175][INFO ][o.e.e.NodeEnvironment [2019-10-30T10:35:52,177][INFO ][o.e.e.NodeEnvironment [2019-10-30T10:35:52,275][INFO ][o.e.n.Node [2019-10-30T10:35:52,276][INFO ][o.e.n.Node [2019-10-30T10:35:52,276][INFO ][o.e.n.Node [2019-10-30T10:35:52,276][INFO ][o.e.n.Node [2019-10-30T10:35:53,341][INFO ][o.e.p.PluginsService [2019-10-30T10:35:53,342][INFO ][o.e.p.PluginsService [2019-10-30T10:35:53,342][INFO ][o.e.p.PluginsService [2019-10-30T10:35:53,342][INFO ][o.e.p.PluginsService [2019-10-30T10:35:53,343][INFO ][o.e.p.PluginsService [2019-10-30T10:35:53,343][INFO ][o.e.p.PluginsService [2019-10-30T10:35:53,344][INFO ][o.e.p.PluginsService [2019-10-30T10:35:53,344][INFO ][o.e.p.PluginsService [2019-10-30T10:35:53,344][INFO ][o.e.p.PluginsService [2019-10-30T10:35:53,345][INFO ][o.e.p.PluginsService [2019-10-30T10:35:53,345][INFO ][o.e.p.PluginsService [2019-10-30T10:35:53,346][INFO ][o.e.p.PluginsService [2019-10-30T10:35:53,346][INFO ][o.e.p.PluginsService [2019-10-30T10:35:53,346][INFO ][o.e.p.PluginsService [2019-10-30T10:35:53,347][INFO ][o.e.p.PluginsService [2019-10-30T10:35:57,715][INFO ][o.e.d.DiscoveryModule [2019-10-30T10:35:58,447][INFO ][o.e.n.Node [2019-10-30T10:35:58,448][INFO ][o.e.n.Node [2019-10-30T10:35:58,635][INFO ][o.e.t.TransportService [2019-10-30T10:35:58,648][INFO ][o.e.b.BootstrapChecks [2019-10-30T10:35:58,658][INFO ][o.e.c.c.Coordinator ，slave节点不能独立工作，重启可正常工作，这个写错了，重启仍不能正常工作，堆栈如下：
] [node_20.23.72.15] using [1] data paths, mounts [[/opt (/dev/mapper/vg_root-lv_opt)]], net usable_space [44gb], net total_space [54.8gb], types [ext4]
] [node_20.23.72.15] heap size [989.8mb], compressed ordinary object pointers [true]
] [node_20.23.72.15] node name [node_20.23.72.15], node ID [JaQ87iQpS0qLomhZvl-tPQ], cluster name [log-search-cluster]
] [node_20.23.72.15] version[7.3.1], pid[16794], build[oss/tar/4749ba6/2019-08-19T20:19:25.651794Z], OS[Linux/4.12.14-95.24-default/amd64], JVM[Huawei Technologies Co., Ltd/OpenJDK 64-Bit Server VM/1.8.0_212/25.212-b04]
] [node_20.23.72.15] JVM home [/opt/oss/rtsp/jre-2.2.20]
] [node_20.23.72.15] JVM arguments [-Xms1g, -Xmx1g, -XX:+UseConcMarkSweepGC, -XX:CMSInitiatingOccupancyFraction=75, -XX:+UseCMSInitiatingOccupancyOnly, -Des.networkaddress.cache.ttl=60, -Des.networkaddress.cache.negative.ttl=10, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -XX:-OmitStackTraceInFastThrow, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Djava.io.tmpdir=/tmp/elasticsearch-7102805610334393263, -XX:+HeapDumpOnOutOfMemoryError, -XX:HeapDumpPath=data, -XX:ErrorFile=logs/hs_err_pid%p.log, -XX:+PrintGCDetails, -XX:+PrintGCDateStamps, -XX:+PrintTenuringDistribution, -XX:+PrintGCApplicationStoppedTime, -Xloggc:logs/gc.log, -XX:+UseGCLogFileRotation, -XX:NumberOfGCLogFiles=32, -XX:GCLogFileSize=64m, -Dio.netty.allocator.type=unpooled, -XX:MaxDirectMemorySize=536870912, -Des.path.home=/opt/oss/envs/Product-LogConfigService/20191029194832476/es, -Des.path.conf=/opt/oss/envs/Product-LogConfigService/20191029194832476/es/config, -Des.distribution.flavor=oss, -Des.distribution.type=tar, -Des.bundled_jdk=true]
] [node_20.23.72.15] loaded module [aggs-matrix-stats]
] [node_20.23.72.15] loaded module [analysis-common]
] [node_20.23.72.15] loaded module [ingest-common]
] [node_20.23.72.15] loaded module [ingest-user-agent]
] [node_20.23.72.15] loaded module [lang-expression]
] [node_20.23.72.15] loaded module [lang-mustache]
] [node_20.23.72.15] loaded module [lang-painless]
] [node_20.23.72.15] loaded module [mapper-extras]
] [node_20.23.72.15] loaded module [parent-join]
] [node_20.23.72.15] loaded module [percolator]
] [node_20.23.72.15] loaded module [rank-eval]
] [node_20.23.72.15] loaded module [reindex]
] [node_20.23.72.15] loaded module [repository-url]
] [node_20.23.72.15] loaded module [transport-netty4]
] [node_20.23.72.15] no plugins loaded
] [node_20.23.72.15] using discovery type [zen] and seed hosts providers [settings]
] [node_20.23.72.15] initialized
] [node_20.23.72.15] starting ...
] [node_20.23.72.15] publish_address {20.23.72.15:27336}, bound_addresses {20.23.72.15:27336}
] [node_20.23.72.15] bound or publishing to a non-loopback address, enforcing bootstrap checks
] [node_20.23.72.15] cluster UUID [O67XhXA9TRG78Qe5yP2Tag]

LiZhe1258 · October 30, 2019, 3:06am

续上
[2019-10-30T10:36:08,674][WARN ][o.e.c.c.ClusterFormationFailureHelper] [node_20.23.72.15] master not discovered or elected yet, an election requires a node with id [pyKYCHjPQLGcM3EwI_Dc8A], have discovered [{node_20.23.72.15}{JaQ87iQpS0qLomhZvl-tPQ}{V_vcpdirTYeeRRz9xKmjlg}{20.23.72.15}{20.23.72.15:27336}{dim}] which is not a quorum; discovery will continue using from hosts providers and [{node_20.23.72.15}{JaQ87iQpS0qLomhZvl-tPQ}{V_vcpdirTYeeRRz9xKmjlg}{20.23.72.15}{20.23.72.15:27336}{dim}] from last-known cluster state; node term 10, last-accepted version 528 in term 10
[2019-10-30T10:36:10,781][DEBUG][o.e.a.a.i.c.TransportCreateIndexAction] [node_20.23.72.15] no known master node, scheduling a retry
[2019-10-30T10:36:10,789][DEBUG][o.e.a.a.i.c.TransportCreateIndexAction] [node_20.23.72.15] no known master node, scheduling a retry
[2019-10-30T10:36:11,005][DEBUG][o.e.a.a.i.c.TransportCreateIndexAction] [node_20.23.72.15] no known master node, scheduling a retry
[2019-10-30T10:36:11,006][DEBUG][o.e.a.a.i.c.TransportCreateIndexAction] [node_20.23.72.15] no known master node, scheduling a retry
[2019-10-30T10:36:11,006][DEBUG][o.e.a.a.i.c.TransportCreateIndexAction] [node_20.23.72.15] no known master node, scheduling a retry
[2019-10-30T10:36:11,007][DEBUG][o.e.a.a.i.c.TransportCreateIndexAction] [node_20.23.72.15] no known master node, scheduling a retry
[2019-10-30T10:36:18,676][WARN ][o.e.c.c.ClusterFormationFailureHelper] [node_20.23.72.15] master not discovered or elected yet, an election requires a node with id [pyKYCHjPQLGcM3EwI_Dc8A], have discovered [{node_20.23.72.15}{JaQ87iQpS0qLomhZvl-tPQ}{V_vcpdirTYeeRRz9xKmjlg}{20.23.72.15}{20.23.72.15:27336}{dim}] which is not a quorum; discovery will continue using from hosts providers and [{node_20.23.72.15}{JaQ87iQpS0qLomhZvl-tPQ}{V_vcpdirTYeeRRz9xKmjlg}{20.23.72.15}{20.23.72.15:27336}{dim}] from last-known cluster state; node term 10, last-accepted version 528 in term 10
[2019-10-30T10:36:28,678][WARN ][o.e.c.c.ClusterFormationFailureHelper] [node_20.23.72.15] master not discovered or elected yet, an election requires a node with id [pyKYCHjPQLGcM3EwI_Dc8A], have discovered [{node_20.23.72.15}{JaQ87iQpS0qLomhZvl-tPQ}{V_vcpdirTYeeRRz9xKmjlg}{20.23.72.15}{20.23.72.15:27336}{dim}] which is not a quorum; discovery will continue using from hosts providers and [{node_20.23.72.15}{JaQ87iQpS0qLomhZvl-tPQ}{V_vcpdirTYeeRRz9xKmjlg}{20.23.72.15}{20.23.72.15:27336}{dim}] from last-known cluster state; node term 10, last-accepted version 528 in term 10
[2019-10-30T10:36:28,688][WARN ][o.e.n.Node ] [node_20.23.72.15] timed out while waiting for initial discovery state - timeout: 30s
[2019-10-30T10:36:28,702][INFO ][o.e.h.AbstractHttpServerTransport] [node_20.23.72.15] publish_address {20.23.72.15:9200}, bound_addresses {20.23.72.15:9200}
[2019-10-30T10:36:28,702][INFO ][o.e.n.Node ] [node_20.23.72.15] started
[2019-10-30T10:36:31,776][DEBUG][o.e.a.a.c.s.TransportClusterStateAction] [node_20.23.72.15] no known master node, scheduling a retry
[2019-10-30T10:36:38,680][WARN ][o.e.c.c.ClusterFormationFailureHelper] [node_20.23.72.15] master not discovered or elected yet, an election requires a node with id [pyKYCHjPQLGcM3EwI_Dc8A], have discovered [{node_20.23.72.15}{JaQ87iQpS0qLomhZvl-tPQ}{V_vcpdirTYeeRRz9xKmjlg}{20.23.72.15}{20.23.72.15:27336}{dim}] which is not a quorum; discovery will continue using from hosts providers and [{node_20.23.72.15}{JaQ87iQpS0qLomhZvl-tPQ}{V_vcpdirTYeeRRz9xKmjlg}{20.23.72.15}{20.23.72.15:27336}{dim}] from last-known cluster state; node term 10, last-accepted version 528 in term 10
[2019-10-30T10:36:38,889][DEBUG][o.e.a.a.i.c.TransportCreateIndexAction] [node_20.23.72.15] no known master node, scheduling a retry
[2019-10-30T10:36:38,890][DEBUG][o.e.a.a.i.c.TransportCreateIndexAction] [node_20.23.72.15] no known master node, scheduling a retry
[2019-10-30T10:36:38,891][DEBUG][o.e.a.a.i.c.TransportCreateIndexAction] [node_20.23.72.15] no known master node, scheduling a retry
[2019-10-30T10:36:38,891][DEBUG][o.e.a.a.i.c.TransportCreateIndexAction] [node_20.23.72.15] no known master node, scheduling a retry
[2019-10-30T10:36:48,682][WARN ][o.e.c.c.ClusterFormationFailureHelper] [node_20.23.72.15] master not discovered or elected yet, an election requires a node with id [pyKYCHjPQLGcM3EwI_Dc8A], have discovered [{node_20.23.72.15}{JaQ87iQpS0qLomhZvl-tPQ}{V_vcpdirTYeeRRz9xKmjlg}{20.23.72.15}{20.23.72.15:27336}{dim}] which is not a quorum; discovery will continue using from hosts providers and [{node_20.23.72.15}{JaQ87iQpS0qLomhZvl-tPQ}{V_vcpdirTYeeRRz9xKmjlg}{20.23.72.15}{20.23.72.15:27336}{dim}] from last-known cluster state; node term 10, last-accepted version 528 in term 10
[2019-10-30T10:36:58,683][WARN ][o.e.c.c.ClusterFormationFailureHelper] [node_20.23.72.15] master not discovered or elected yet, an election requires a node with id [pyKYCHjPQLGcM3EwI_Dc8A], have discovered [{node_20.23.72.15}{JaQ87iQpS0qLomhZvl-tPQ}{V_vcpdirTYeeRRz9xKmjlg}{20.23.72.15}{20.23.72.15:27336}{dim}] which is not a quorum; discovery will continue using from hosts providers and [{node_20.23.72.15}{JaQ87iQpS0qLomhZvl-tPQ}{V_vcpdirTYeeRRz9xKmjlg}{20.23.72.15}{20.23.72.15:27336}{dim}] from last-known cluster state; node term 10, last-accepted version 528 in term 10
[2019-10-30T10:37:01,780][DEBUG][o.e.a.a.c.s.TransportClusterStateAction] [node_20.23.72.15] timed out while retrying [cluster:monitor/state] after failure (timeout [30s])
[2019-10-30T10:37:01,797][DEBUG][o.e.a.a.c.s.TransportClusterStateAction] [node_20.23.72.15] no known master node, scheduling a retry
[2019-10-30T10:37:08,685][WARN ][o.e.c.c.ClusterFormationFailureHelper] [node_20.23.72.15] master not discovered or elected yet, an election requires a node with id [pyKYCHjPQLGcM3EwI_Dc8A], have discovered [{node_20.23.72.15}{JaQ87iQpS0qLomhZvl-tPQ}{V_vcpdirTYeeRRz9xKmjlg}{20.23.72.15}{20.23.72.15:27336}{dim}] which is not a quorum; discovery will continue using from hosts providers and [{node_20.23.72.15}{JaQ87iQpS0qLomhZvl-tPQ}{V_vcpdirTYeeRRz9xKmjlg}{20.23.72.15}{20.23.72.15:27336}{dim}] from last-known cluster state; node term 10, last-accepted version 528 in term 10
[2019-10-30T10:37:10,789][DEBUG][o.e.a.a.i.c.TransportCreateIndexAction] [node_20.23.72.15] timed out while retrying [indices:admin/create] after failure (timeout [1m])
[2019-10-30T10:37:10,790][DEBUG][o.e.a.a.i.c.TransportCreateIndexAction] [node_20.23.72.15] timed out while retrying [indices:admin/create] after failure (timeout [1m])
[2019-10-30T10:37:11,007][DEBUG][o.e.a.a.i.c.TransportCreateIndexAction] [node_20.23.72.15] timed out while retrying [indices:admin/create] after failure (timeout [1m])

Christian_Dahlqvist · October 30, 2019, 5:02am

You can not have a highly available cluster with 2 nodes. At least 3 master eligible nodes are required.

LiZhe1258 · October 30, 2019, 5:56am

Mr [Christian_Dahlqvist, thanks for your suggestion!
I know that at least three of the previous ones are better, but the company only provides two nodes, and the ES6 can run normally, so it will not provide me with redundant nodes at present, which requires cost.

Christian_Dahlqvist · October 30, 2019, 6:12am

If ES6 can run normally when one node goes down, it is incorrectly configured which is very likely to lead to data loss if you encounter a network partition. See the documentation for further details. If you were to correctly configure it by setting discovery.zen.minimum_master_nodes to 2 it would behave just like ES7.

Elasticsearch 7.4 supports a voting-only dedicated master node that requires limited resources. I would recommend you try and add one of these for resiliency as that would give you 3 master eligible nodes at little extra cost.

Staying on an incorrectly configured 6.X cluster is not recommended.

system · November 27, 2019, 6:12am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Shutdown master means breakdown the cluster's service? Elasticsearch	8	2335	July 6, 2017
Nodes fail to join cluster - potential split brain scenario Elasticsearch	11	563	July 6, 2017
Master fall Elasticsearch	6	327	July 6, 2017
Can't start ES 6.4 cluster with 3 master nodes Elasticsearch	7	1521	November 6, 2018
Split brain? Elasticsearch	8	551	July 6, 2017

[ES6 upgrade ES7] The two-node cluster works normally, but the master node fails, the slave node tries to reconnect, and does not recommend itself as the new master node

Related topics