The cluster often reports an error [service_unavailable / 2 / no master]

HI!
The [service ﹣ unavailable / 2 / no master] error message often occurs in the cluster master node. Have you ever encountered it? It will appear after running for a period of time. After an error is reported, you can only restart the whole cluster to recover. Three master and data, one data SSD, es7.4.0,The cluster has 5K indexes and 20K shards write 500GB of ES every day

elasticsearch.yml:

cluster.name: myelk
node.name: es01
network.host: 192.168.0.4
http.port: 9200
bootstrap.memory_lock: false
http.cors.enabled: true
http.cors.allow-origin: "*"
http.cors.allow-headers: Authorization,X-Requested-With,Content-Length,Content-Type
node.master: true
node.data: true
discovery.zen.ping_timeout: 1200s
xpack.monitoring.collection.cluster.stats.timeout: 180s
xpack.monitoring.collection.node.stats.timeout: 180s
xpack.monitoring.collection.index.recovery.timeout: 180s
discovery.seed_hosts: ["192.168.0.4","192.168.0.5","192.168.0.6","192.168.0.7"]
cluster.initial_master_nodes: ["192.168.0.4"]
cluster.routing.allocation.disk.watermark.low: 100gb
cluster.routing.allocation.disk.watermark.high: 50gb
cluster.routing.allocation.disk.watermark.flood_stage: 30gb
discovery.zen.minimum_master_nodes: 2
bootstrap.system_call_filter: false
cluster.max_shards_per_node: 1000000
indices.queries.cache.count: 20000
xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.verification_mode: certificate
xpack.security.transport.ssl.keystore.path: elastic-certificates.p12
xpack.security.transport.ssl.truststore.path: elastic-certificates.p12
cluster.routing.allocation.node_initial_primaries_recoveries: 64
cluster.routing.allocation.node_concurrent_recoveries: 64
indices.recovery.max_bytes_per_sec: 0

In case of failure, the system load and disk IO are not high and stable, and the network has been checked and no problem is normal. The following is the error log:

[2020-04-07T21:40:15,584][DEBUG][o.e.a.a.c.s.TransportClusterStateAction] [es01] no known master node, scheduling a retry
[2020-04-07T21:40:25,200][DEBUG][o.e.a.a.c.s.TransportClusterStateAction] [es01] timed out while retrying [cluster:monitor/state] after failure (timeout [30s])
[2020-04-07T21:40:27,217][WARN ][r.suppressed ] [es01] path: /_monitoring/bulk, params: {system_id=kibana, system_api_version=6, interval=10000ms}
org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/2/no master];
at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:189) ~[elasticsearch-7.4.0.jar:7.4.0]
at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedRaiseException(ClusterBlocks.java:175) ~[elasticsearch-7.4.0.jar:7.4.0]
at org.elasticsearch.xpack.monitoring.action.TransportMonitoringBulkAction.doExecute(TransportMonitoringBulkAction.java:55) ~[?:?]
at org.elasticsearch.xpack.monitoring.action.TransportMonitoringBulkAction.doExecute(TransportMonitoringBulkAction.java:35) ~[?:?]
at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:153) [elasticsearch-7.4.0.jar:7.4.0]
at org.elasticsearch.xpack.security.action.filter.SecurityActionFilter.lambda$apply$0(SecurityActionFilter.java:86) [x-pack-security-7.4.0.jar:7.4.0]
at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:62) [elasticsearch-7.4.0.jar:7.4.0]
at org.elasticsearch.xpack.security.action.filter.SecurityActionFilter.lambda$authorizeRequest$4(SecurityActionFilter.java:172) [x-pack-security-7.4.0.jar:7.4.0]

[2020-04-07T21:40:47,992][DEBUG][o.e.a.a.c.s.TransportClusterStateAction] [es01] no known master node, scheduling a retry
[2020-04-07T21:40:57,415][DEBUG][o.e.a.a.c.s.TransportClusterStateAction] [es01] timed out while retrying [cluster:monitor/state] after failure (timeout [30s])
[2020-04-07T21:40:57,415][WARN ][r.suppressed ] [es01] path: /_cluster/settings, params: {include_defaults=true}
org.elasticsearch.discovery.MasterNotDiscoveredException: null
at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$2.onTimeout(TransportMasterNodeAction.java:214) [elasticsearch-7.4.0.jar:7.4.0]
at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:325) [elasticsearch-7.4.0.jar:7.4.0]
at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:252) [elasticsearch-7.4.0.jar:7.4.0]
at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:598) [elasticsearch-7.4.0.jar:7.4.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:703) [elasticsearch-7.4.0.jar:7.4.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:830) [?:?]
[2020-04-07T21:40:57,438][WARN ][r.suppressed ] [es01] path: /_monitoring/bulk, params: {system_id=kibana, system_api_version=6, interval=10000ms}
org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/2/no master];
at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:189) ~[elasticsearch-7.4.0.jar:7.4.0]
at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedRaiseException(ClusterBlocks.java:175) ~[elasticsearch-7.4.0.jar:7.4.0]
at org.elasticsearch.xpack.monitoring.action.TransportMonitoringBulkAction.doExecute(TransportMonitoringBulkAction.java:55) ~[?:?]
at org.elasticsearch.xpack.monitoring.action.TransportMonitoringBulkAction.doExecute(TransportMonitoringBulkAction.java:35) ~[?:?]
at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:153) [elasticsearch-7.4.0.jar:7.4.0]
at org.elasticsearch.xpack.security.action.filter.SecurityActionFilter.lambda$apply$0(SecurityActionFilter.java:86) [x-pack-security-7.4.0.jar:7.4.0]
at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:62) [elasticsearch-7.4.0.jar:7.4.0]
at org.elasticsearch.xpack.security.action.filter.SecurityActionFilter.lambda$authorizeRequest$4(SecurityActionFilter.java:172) [x-pack-security-7.4.0.jar:7.4.0]

[2020-04-10T09:28:02,405][WARN ][o.e.c.s.MasterService ] [es01] took [11.4s], which is over [10s], to compute cluster state update for [cluster_reroute(reroute after starting shards)]
[2020-04-10T09:28:13,845][WARN ][o.e.c.s.MasterService ] [es01] took [10.1s], which is over [10s], to compute cluster state update for [cluster_reroute(reroute after starting shards)]
[2020-04-10T09:31:03,445][WARN ][o.e.m.j.JvmGcMonitorService] [es01] [gc][old][62478][5283] duration [33.5s], collections [1]/[34.1s], total [33.5s]/[6.4m], memory [27.7gb]->[27.3gb]/[31.8gb], all_pools {[young] [9.4mb]->[280.2mb]/[1.4gb]}{[survivor] [187.4mb]->[0b]/[191.3mb]}{[old] [27.5gb]->[27.1gb]/[30.1gb]}
[2020-04-10T09:31:03,445][WARN ][o.e.m.j.JvmGcMonitorService] [es01] [gc][62478] overhead, spent [33.8s] collecting in the last [34.1s]
[2020-04-10T09:31:03,537][INFO ][o.e.c.s.ClusterApplierService] [es01] master node changed {previous [{es01}{89eH7Ca7TCKsiRie1g5IIA}{5YXGPetDSLyFKy7mx52JmQ}{192.168.0.4}{192.168.0.4:9300}{dilm}{ml.machine_memory=67540819968, xpack.installed=true, ml.max_open_jobs=20}], current }, term: 225, version: 60213, reason: becoming candidate: joinLeaderInTerm
[2020-04-10T09:31:03,940][WARN ][o.e.t.TcpTransport ] [es01] exception caught on transport layer [Netty4TcpChannel{localAddress=/192.168.0.4:9300, remoteAddress=/192.168.0.6:61043}], closing connection
io.netty.handler.codec.DecoderException: javax.net.ssl.SSLHandshakeException: Insufficient buffer remaining for AEAD cipher fragment (2). Needs to be more than tag size (16)
at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:475) ~[netty-codec-4.1.38.Final.jar:4.1.38.Final]
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:283) ~[netty-codec-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1421) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:930) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:697) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:597) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:551) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:511) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918) [netty-common-4.1.38.Final.jar:4.1.38.Final]
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.38.Final.jar:4.1.38.Final]
at java.lang.Thread.run(Thread.java:830) [?:?]
Caused by: javax.net.ssl.SSLHandshakeException: Insufficient buffer remaining for AEAD cipher fragment (2). Needs to be more than tag size (16)
at sun.security.ssl.Alert.createSSLException(Alert.java:131) ~[?:?]
at sun.security.ssl.TransportContext.fatal(TransportContext.java:324) ~[?:?]
at sun.security.ssl.TransportContext.fatal(TransportContext.java:267) ~[?:?]
at sun.security.ssl.TransportContext.fatal(TransportContext.java:262) ~[?:?]
at sun.security.ssl.SSLTransport.decode(SSLTransport.java:129) ~[?:?]
at sun.security.ssl.SSLEngineImpl.decode(SSLEngineImpl.java:729) ~[?:?]
at sun.security.ssl.SSLEngineImpl.readRecord(SSLEngineImpl.java:684) ~[?:?]
at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:499) ~[?:?]
at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:475) ~[?:?]
at javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:634) ~[?:?]
at io.netty.handler.ssl.SslHandler$SslEngineType$3.unwrap(SslHandler.java:282) ~[netty-handler-4.1.38.Final.jar:4.1.38.Final]
at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1329) ~[netty-handler-4.1.38.Final.jar:4.1.38.Final]
at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1224) ~[netty-handler-4.1.38.Final.jar:4.1.38.Final]
at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1271) ~[netty-handler-4.1.38.Final.jar:4.1.38.Final]
at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:505) ~[netty-codec-4.1.38.Final.jar:4.1.38.Final]
at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:444) ~[netty-codec-4.1.38.Final.jar:4.1.38.Final]

It looks like your cluster is having issues with GC and cluster state updates taking a long time, which is affecting stability.

This is likely due to the excessive setting you have in your configuration. You already have far too many shards for a cluster that size, and I suspect this is a big part of your problems. Please read this blog post and the look to dramatically reduce the number of shards in the cluster and revert the settings I linked to to the default values.

Why have you introduced this custom setting?

Are you by any chance using swap??

cluster.max_shards_per_node: 1000000
indices.queries.cache.count: 20000

OK, I'll try to reduce the number of slices,

discovery.zen.ping_timeout: 1200s

Because the cluster primary node is often lost, check the settings to see if they work

bootstrap.memory_lock: false

swap is Close

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.