HI!
The [service ﹣ unavailable / 2 / no master] error message often occurs in the cluster master node. Have you ever encountered it? It will appear after running for a period of time. After an error is reported, you can only restart the whole cluster to recover. Three master and data, one data SSD, es7.4.0,The cluster has 5K indexes and 20K shards write 500GB of ES every day
elasticsearch.yml:
cluster.name: myelk
node.name: es01
network.host: 192.168.0.4
http.port: 9200
bootstrap.memory_lock: false
http.cors.enabled: true
http.cors.allow-origin: "*"
http.cors.allow-headers: Authorization,X-Requested-With,Content-Length,Content-Type
node.master: true
node.data: true
discovery.zen.ping_timeout: 1200s
xpack.monitoring.collection.cluster.stats.timeout: 180s
xpack.monitoring.collection.node.stats.timeout: 180s
xpack.monitoring.collection.index.recovery.timeout: 180s
discovery.seed_hosts: ["192.168.0.4","192.168.0.5","192.168.0.6","192.168.0.7"]
cluster.initial_master_nodes: ["192.168.0.4"]
cluster.routing.allocation.disk.watermark.low: 100gb
cluster.routing.allocation.disk.watermark.high: 50gb
cluster.routing.allocation.disk.watermark.flood_stage: 30gb
discovery.zen.minimum_master_nodes: 2
bootstrap.system_call_filter: false
cluster.max_shards_per_node: 1000000
indices.queries.cache.count: 20000
xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.verification_mode: certificate
xpack.security.transport.ssl.keystore.path: elastic-certificates.p12
xpack.security.transport.ssl.truststore.path: elastic-certificates.p12
cluster.routing.allocation.node_initial_primaries_recoveries: 64
cluster.routing.allocation.node_concurrent_recoveries: 64
indices.recovery.max_bytes_per_sec: 0
In case of failure, the system load and disk IO are not high and stable, and the network has been checked and no problem is normal. The following is the error log:
[2020-04-07T21:40:15,584][DEBUG][o.e.a.a.c.s.TransportClusterStateAction] [es01] no known master node, scheduling a retry
[2020-04-07T21:40:25,200][DEBUG][o.e.a.a.c.s.TransportClusterStateAction] [es01] timed out while retrying [cluster:monitor/state] after failure (timeout [30s])
[2020-04-07T21:40:27,217][WARN ][r.suppressed ] [es01] path: /_monitoring/bulk, params: {system_id=kibana, system_api_version=6, interval=10000ms}
org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/2/no master];
at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:189) ~[elasticsearch-7.4.0.jar:7.4.0]
at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedRaiseException(ClusterBlocks.java:175) ~[elasticsearch-7.4.0.jar:7.4.0]
at org.elasticsearch.xpack.monitoring.action.TransportMonitoringBulkAction.doExecute(TransportMonitoringBulkAction.java:55) ~[?:?]
at org.elasticsearch.xpack.monitoring.action.TransportMonitoringBulkAction.doExecute(TransportMonitoringBulkAction.java:35) ~[?:?]
at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:153) [elasticsearch-7.4.0.jar:7.4.0]
at org.elasticsearch.xpack.security.action.filter.SecurityActionFilter.lambda$apply$0(SecurityActionFilter.java:86) [x-pack-security-7.4.0.jar:7.4.0]
at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:62) [elasticsearch-7.4.0.jar:7.4.0]
at org.elasticsearch.xpack.security.action.filter.SecurityActionFilter.lambda$authorizeRequest$4(SecurityActionFilter.java:172) [x-pack-security-7.4.0.jar:7.4.0]
[2020-04-07T21:40:47,992][DEBUG][o.e.a.a.c.s.TransportClusterStateAction] [es01] no known master node, scheduling a retry
[2020-04-07T21:40:57,415][DEBUG][o.e.a.a.c.s.TransportClusterStateAction] [es01] timed out while retrying [cluster:monitor/state] after failure (timeout [30s])
[2020-04-07T21:40:57,415][WARN ][r.suppressed ] [es01] path: /_cluster/settings, params: {include_defaults=true}
org.elasticsearch.discovery.MasterNotDiscoveredException: null
at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$2.onTimeout(TransportMasterNodeAction.java:214) [elasticsearch-7.4.0.jar:7.4.0]
at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:325) [elasticsearch-7.4.0.jar:7.4.0]
at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:252) [elasticsearch-7.4.0.jar:7.4.0]
at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:598) [elasticsearch-7.4.0.jar:7.4.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:703) [elasticsearch-7.4.0.jar:7.4.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:830) [?:?]
[2020-04-07T21:40:57,438][WARN ][r.suppressed ] [es01] path: /_monitoring/bulk, params: {system_id=kibana, system_api_version=6, interval=10000ms}
org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/2/no master];
at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:189) ~[elasticsearch-7.4.0.jar:7.4.0]
at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedRaiseException(ClusterBlocks.java:175) ~[elasticsearch-7.4.0.jar:7.4.0]
at org.elasticsearch.xpack.monitoring.action.TransportMonitoringBulkAction.doExecute(TransportMonitoringBulkAction.java:55) ~[?:?]
at org.elasticsearch.xpack.monitoring.action.TransportMonitoringBulkAction.doExecute(TransportMonitoringBulkAction.java:35) ~[?:?]
at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:153) [elasticsearch-7.4.0.jar:7.4.0]
at org.elasticsearch.xpack.security.action.filter.SecurityActionFilter.lambda$apply$0(SecurityActionFilter.java:86) [x-pack-security-7.4.0.jar:7.4.0]
at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:62) [elasticsearch-7.4.0.jar:7.4.0]
at org.elasticsearch.xpack.security.action.filter.SecurityActionFilter.lambda$authorizeRequest$4(SecurityActionFilter.java:172) [x-pack-security-7.4.0.jar:7.4.0]
[2020-04-10T09:28:02,405][WARN ][o.e.c.s.MasterService ] [es01] took [11.4s], which is over [10s], to compute cluster state update for [cluster_reroute(reroute after starting shards)]
[2020-04-10T09:28:13,845][WARN ][o.e.c.s.MasterService ] [es01] took [10.1s], which is over [10s], to compute cluster state update for [cluster_reroute(reroute after starting shards)]
[2020-04-10T09:31:03,445][WARN ][o.e.m.j.JvmGcMonitorService] [es01] [gc][old][62478][5283] duration [33.5s], collections [1]/[34.1s], total [33.5s]/[6.4m], memory [27.7gb]->[27.3gb]/[31.8gb], all_pools {[young] [9.4mb]->[280.2mb]/[1.4gb]}{[survivor] [187.4mb]->[0b]/[191.3mb]}{[old] [27.5gb]->[27.1gb]/[30.1gb]}
[2020-04-10T09:31:03,445][WARN ][o.e.m.j.JvmGcMonitorService] [es01] [gc][62478] overhead, spent [33.8s] collecting in the last [34.1s]
[2020-04-10T09:31:03,537][INFO ][o.e.c.s.ClusterApplierService] [es01] master node changed {previous [{es01}{89eH7Ca7TCKsiRie1g5IIA}{5YXGPetDSLyFKy7mx52JmQ}{192.168.0.4}{192.168.0.4:9300}{dilm}{ml.machine_memory=67540819968, xpack.installed=true, ml.max_open_jobs=20}], current }, term: 225, version: 60213, reason: becoming candidate: joinLeaderInTerm
[2020-04-10T09:31:03,940][WARN ][o.e.t.TcpTransport ] [es01] exception caught on transport layer [Netty4TcpChannel{localAddress=/192.168.0.4:9300, remoteAddress=/192.168.0.6:61043}], closing connection
io.netty.handler.codec.DecoderException: javax.net.ssl.SSLHandshakeException: Insufficient buffer remaining for AEAD cipher fragment (2). Needs to be more than tag size (16)
at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:475) ~[netty-codec-4.1.38.Final.jar:4.1.38.Final]
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:283) ~[netty-codec-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1421) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:930) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:697) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:597) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:551) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:511) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918) [netty-common-4.1.38.Final.jar:4.1.38.Final]
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.38.Final.jar:4.1.38.Final]
at java.lang.Thread.run(Thread.java:830) [?:?]
Caused by: javax.net.ssl.SSLHandshakeException: Insufficient buffer remaining for AEAD cipher fragment (2). Needs to be more than tag size (16)
at sun.security.ssl.Alert.createSSLException(Alert.java:131) ~[?:?]
at sun.security.ssl.TransportContext.fatal(TransportContext.java:324) ~[?:?]
at sun.security.ssl.TransportContext.fatal(TransportContext.java:267) ~[?:?]
at sun.security.ssl.TransportContext.fatal(TransportContext.java:262) ~[?:?]
at sun.security.ssl.SSLTransport.decode(SSLTransport.java:129) ~[?:?]
at sun.security.ssl.SSLEngineImpl.decode(SSLEngineImpl.java:729) ~[?:?]
at sun.security.ssl.SSLEngineImpl.readRecord(SSLEngineImpl.java:684) ~[?:?]
at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:499) ~[?:?]
at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:475) ~[?:?]
at javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:634) ~[?:?]
at io.netty.handler.ssl.SslHandler$SslEngineType$3.unwrap(SslHandler.java:282) ~[netty-handler-4.1.38.Final.jar:4.1.38.Final]
at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1329) ~[netty-handler-4.1.38.Final.jar:4.1.38.Final]
at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1224) ~[netty-handler-4.1.38.Final.jar:4.1.38.Final]
at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1271) ~[netty-handler-4.1.38.Final.jar:4.1.38.Final]
at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:505) ~[netty-codec-4.1.38.Final.jar:4.1.38.Final]
at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:444) ~[netty-codec-4.1.38.Final.jar:4.1.38.Final]