Failed to connect to busy master

   org.elasticsearch.transport.ConnectTransportException: [IF4wObj][10.xxx.xxx.xx:9300] connect_exception
	at org.elasticsearch.transport.TcpChannel.awaitConnected(TcpChannel.java:165) ~[elasticsearch-6.3.1.jar:6.3.1]
	at org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:631) ~[elasticsearch-6.3.1.jar:6.3.1]
	at org.elasticsearch.transport.TcpTransport.connectToNode(TcpTransport.java:530) ~[elasticsearch-6.3.1.jar:6.3.1]
	at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:331) ~[elasticsearch-6.3.1.jar:6.3.1]
	at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:318) ~[elasticsearch-6.3.1.jar:6.3.1]
	at org.elasticsearch.discovery.zen.ZenDiscovery.joinElectedMaster(ZenDiscovery.java:509) [elasticsearch-6.3.1.jar:6.3.1]
	at org.elasticsearch.discovery.zen.ZenDiscovery.innerJoinCluster(ZenDiscovery.java:477) [elasticsearch-6.3.1.jar:6.3.1]
	at org.elasticsearch.discovery.zen.ZenDiscovery.access$2500(ZenDiscovery.java:90) [elasticsearch-6.3.1.jar:6.3.1]
	at org.elasticsearch.discovery.zen.ZenDiscovery$JoinThreadControl$1.run(ZenDiscovery.java:1247) [elasticsearch-6.3.1.jar:6.3.1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:626) [elasticsearch-6.3.1.jar:6.3.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_172]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_172]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172]
     Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection timed out: 10.xxx.xxx.xx/10.xxx.xx.xx:9300

Master node logs(GC is taking time)
        [2018-09-27T03:46:02,600][WARN ][o.e.d.z.PublishClusterStateAction] [IF4wObj] publishing cluster state with version [16947] failed for the following nodes: [[{jUNEiLv}{jUNEiLvvSs-ZNtXW_2ZJSA}{UMlogXD5QjOmAU5OPckHow}{xx.xx.xx.xx}{xx.xxx.xxx.xx:9300}]]
        [2018-09-27T03:46:02,607][INFO ][o.e.m.j.JvmGcMonitorService] [IF4wObj] [gc][232593] overhead, spent [570ms] collecting in the last [1.1s]
        [2018-09-27T03:46:04,608][WARN ][o.e.m.j.JvmGcMonitorService] [IF4wObj] [gc][232595] overhead, spent [585ms] collecting in the last [1s]
        [2018-09-27T03:46:06,608][WARN ][o.e.m.j.JvmGcMonitorService] [IF4wObj] [gc][232597] overhead, spent [561ms] collecting in the last [1s]
        [2018-09-27T03:46:08,609][WARN ][o.e.m.j.JvmGcMonitorService] [IF4wObj] [gc][232599] overhead, spent [638ms] collecting in the last [1s]
        [2018-09-27T03:46:10,609][WARN ][o.e.m.j.JvmGcMonitorService] [IF4wObj] [gc][232601] overhead, spent [516ms] collecting in the last [1s]
        [2018-09-27T03:46:10,933][WARN ][o.e.d.z.PublishClusterStateAction] [IF4wObj] publishing cluster state with version [16948] failed for the following nodes: [[{jUNEiLv}{jUNEiLvvSs-ZNtXW_2ZJSA}{UMlogXD5QjOmAU5OPckHow}{xx.xx.xx.xx}{xx.xx.xx.xx:9300}]]

Hot threads are 
java.util.zip.Deflater.deflateBytes(Native Method)
   java.util.zip.Deflater.deflate(Deflater.java:444)
   java.util.zip.Deflater.deflate(Deflater.java:366)
   java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:251)
   java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:211)
   java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
   java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
   org.elasticsearch.common.io.stream.OutputStreamStreamOutput.writeBytes(OutputStreamStreamOutput.java:40)
   org.elasticsearch.common.io.stream.StreamOutput.writeBytes(StreamOutput.java:160)
   org.elasticsearch.common.compress.CompressedXContent.writeTo(CompressedXContent.java:163)
   org.elasticsearch.cluster.metadata.MappingMetaData.writeTo(MappingMetaData.java:182)
   org.elasticsearch.cluster.metadata.IndexMetaData.writeTo(IndexMetaData.java:758)
   org.elasticsearch.cluster.metadata.MetaData.writeTo(MetaData.java:793)
   org.elasticsearch.cluster.ClusterState.writeTo(ClusterState.java:741)
   org.elasticsearch.discovery.zen.PublishClusterStateAction.serializeFullClusterState(PublishClusterStateAction.java:355)
   org.elasticsearch.action.admin.cluster.state.TransportClusterStateAction.masterOperation(TransportClusterStateAction.java:136)
   org.elasticsearch.action.admin.cluster.state.TransportClusterStateAction.masterOperation(TransportClusterStateAction.java:44)

One node is attempting to join the cluster. It is able to ping other nodes in the cluster to figure out the master node. However since there is 30k shards in the cluster with master having 16 cores of processor, master is experiencing above 80% cpu usage. I tried increasing the tcp connection timeout but even that doesn't help. Is there a dedicated thread pool for discovery?

How many nodes do you have in the cluster? What is the specification of these nodes? Which version of Elasticsearch are you running?

200 data node cluster: i3.2xlarge
3 dedicated master node c4.2xlarge
ES version 6.3

That sounds like a very large instance for a dedicated master node, given that they should sit and just manage the cluster and not handle requests. What type of load is it under that causes so much CPU usage? Can you show the output of the hot threads API?

101.1% (505.2ms out of 500ms) cpu usage by thread 'elasticsearch[IF4wObj][[transport_server_worker.default]][T#30]'
 5/10 snapshots sharing following 62 elements
   java.util.zip.Deflater.deflateBytes(Native Method)
   java.util.zip.Deflater.deflate(Deflater.java:444)
   java.util.zip.Deflater.deflate(Deflater.java:366)
   java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:251)
   java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:211)
   java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
   java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
   org.elasticsearch.common.io.stream.OutputStreamStreamOutput.writeBytes(OutputStreamStreamOutput.java:40)
   org.elasticsearch.common.io.stream.StreamOutput.writeBytes(StreamOutput.java:170)
   org.elasticsearch.common.io.stream.StreamOutput.writeString(StreamOutput.java:397)
   org.elasticsearch.common.io.stream.StreamOutput.writeStringArray(StreamOutput.java:463)
   org.elasticsearch.cluster.DiffableUtils$StringSetValueSerializer.write(DiffableUtils.java:702)
   org.elasticsearch.cluster.metadata.IndexMetaData.writeTo(IndexMetaData.java:772)
   org.elasticsearch.cluster.metadata.MetaData.writeTo(MetaData.java:793)
   org.elasticsearch.cluster.ClusterState.writeTo(ClusterState.java:741)
   org.elasticsearch.discovery.zen.PublishClusterStateAction.serializeFullClusterState(PublishClusterStateAction.java:355)
   org.elasticsearch.action.admin.cluster.state.TransportClusterStateAction.masterOperation(TransportClusterStateAction.java:136)
   org.elasticsearch.action.admin.cluster.state.TransportClusterStateAction.masterOperation(TransportClusterStateAction.java:44)
   org.elasticsearch.action.support.master.TransportMasterNodeAction.masterOperation(TransportMasterNodeAction.java:88)
   org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$2.doRun(TransportMasterNodeAction.java:174)
   org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
   org.elasticsearch.common.util.concurrent.EsExecutors$1.execute(EsExecutors.java:135)
   org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction.doStart(TransportMasterNodeAction.java:171)
   org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction.start(TransportMasterNodeAction.java:127)
   org.elasticsearch.action.support.master.TransportMasterNodeAction.doExecute(TransportMasterNodeAction.java:105)
   org.elasticsearch.action.support.master.TransportMasterNodeAction.doExecute(TransportMasterNodeAction.java:55)
   org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:167)
   org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:139)
   org.elasticsearch.action.support.HandledTransportAction$TransportHandler.messageReceived(HandledTransportAction.java:79)
   org.elasticsearch.action.support.HandledTransportAction$TransportHandler.messageReceived(HandledTransportAction.java:69)
   org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66)
   org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1592)
   org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
   org.elasticsearch.common.util.concurrent.EsExecutors$1.execute(EsExecutors.java:135)
   org.elasticsearch.transport.TcpTransport.handleRequest(TcpTransport.java:1550)
   org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:1414)
   org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:64)
   io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
   io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
   io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
   io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310)
   io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:297)
   io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:413)
   io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265)
   io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
   io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
   io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
   io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241)
   io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
   io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
   io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
   io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1359)
   io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
   io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
   io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:935)
   io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:134)
   io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
   io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:545)
   io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:499)
   io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
   io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
   java.lang.Thread.run(Thread.java:748)
4/10 snapshots sharing following 61 elements
   java.util.zip.Deflater.deflateBytes(Native Method)
   java.util.zip.Deflater.deflate(Deflater.java:444)
   java.util.zip.Deflater.deflate(Deflater.java:366)
   java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:251)
   java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:211)
   java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
   java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
   org.elasticsearch.common.io.stream.OutputStreamStreamOutput.writeBytes(OutputStreamStreamOutput.java:40)
   org.elasticsearch.common.io.stream.StreamOutput.writeBytes(StreamOutput.java:160)
   org.elasticsearch.common.compress.CompressedXContent.writeTo(CompressedXContent.java:163)
   org.elasticsearch.cluster.metadata.MappingMetaData.writeTo(MappingMetaData.java:182)
   org.elasticsearch.cluster.metadata.IndexMetaData.writeTo(IndexMetaData.java:758)
   org.elasticsearch.cluster.metadata.MetaData.writeTo(MetaData.java:793)
   org.elasticsearch.cluster.ClusterState.writeTo(ClusterState.java:741)
   org.elasticsearch.discovery.zen.PublishClusterStateAction.serializeFullClusterState(PublishClusterStateAction.java:355)
   org.elasticsearch.action.admin.cluster.state.TransportClusterStateAction.masterOperation(TransportClusterStateAction.java:136)
   org.elasticsearch.action.admin.cluster.state.TransportClusterStateAction.masterOperation(TransportClusterStateAction.java:44)
   org.elasticsearch.action.support.master.TransportMasterNodeAction.masterOperation(TransportMasterNodeAction.java:88)
   org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$2.doRun(TransportMasterNodeAction.java:174)
   org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
   org.elasticsearch.common.util.concurrent.EsExecutors$1.execute(EsExecutors.java:135)
   org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction.doStart(TransportMasterNodeAction.java:171)
   org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction.start(TransportMasterNodeAction.java:127)
   org.elasticsearch.action.support.master.TransportMasterNodeAction.doExecute(TransportMasterNodeAction.java:105)
   org.elasticsearch.action.support.master.TransportMasterNodeAction.doExecute(TransportMasterNodeAction.java:55)
   org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:167)
   org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:139)
   org.elasticsearch.action.support.HandledTransportAction$TransportHandler.messageReceived(HandledTransportAction.java:79)
   org.elasticsearch.action.support.HandledTransportAction$TransportHandler.messageReceived(HandledTransportAction.java:69)
   org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66)
   org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1592)
   org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
   org.elasticsearch.common.util.concurrent.EsExecutors$1.execute(EsExecutors.java:135)
   org.elasticsearch.transport.TcpTransport.handleRequest(TcpTransport.java:1550)
   org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:1414)
   org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:64)
   io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
   io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
   io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
   io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310)
   io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:297)
   io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:413)
   io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265)
   io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
   io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
   io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
   io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241)
   io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
   io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
   io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
   io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1359)
   io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
   io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
   io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:935)
   io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:134)
   io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
   io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:545)
   io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:499)
   io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
   io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
   java.lang.Thread.run(Thread.java:748)

It looks like you are using them as coordinating nodes in addition to their master role, is that correct?

If that is the case I would recommend keeping these as coordinating-only nodes and add 3 small dedicated master nodes to the cluster. These should not receive any requests and be left alone to manage the cluster. They do not need much CPU, so r5.large or r5.xlarge instances might be suitable.

node.data: false
node.ingest: false
node.master: true

The cluster isn't serving any requests(search/index). The health of the cluster is flaky. Some nodes are unable to join the cluster as mentioned in the first stack trace.
In my case I saw all cores being utilized with avg CPU around 85%. JVMGC monitoring also showed signs of older generation GC through the warn logging

It seems like there are issues publishing the cluster state. Do you see evidence of high heap pressure or long/frequent GC on the data nodes? What does the heap usage pattern of the data nodes look like?

What does the heap usage pattern on the master node look like?

Datanodes since there is no search/aggregation queries of any sort I don't see a high JVM memory pressure. However master nodes do have. I have noticed JVM usage spiking up in master. Is there a way I could know how much CPU is going in GC as CPU tends to stay almost flat above 85%

Also does a bigger cluster state(owing to large number of shards) mean that master would struggle with publishing cluster state and thereby making the cluster flaky

How much heap does the dedicated master nodes have? What is the lowest level it goes down to following GC? What is the state of the node it reports having problems publishing the cluster state to?

32 GB of heap is what was being used when I had bumped master to c4.8xlarge. When I had done a heap dump these were the results. I am assuming Gateway Allocator maintains a shard level state of which nodes hold an up to date copy. Master is unable to publish cluster state to the nodes which are unhealthy i.e struggling with connection timeouts to connect to master

I guess deserializing the full cluster state is consuming the CPU. Just wondering the zen-disco-node-join event could be bumped up to IMMEDIATE from URGENT to allow nodes to join master. Any specific reason why it is set to URGENT. Since shard-started is also URGENT , a cluster recovery can cause nodes to be disconnected for a longer time @Christian_Dahlqvist @ywelsch

It looks to me as if there are different unrelated issues being discussed here. The stack trace in the opening post shows that a node can figure out who the master is, and can connect to the master, but the master fails to connect back to the node. This generally indicates network configuration issues.

An unrelated issue seems to be how long it takes to serialize the cluster state. Based on the timings here, I suspect that this cluster state has humongous mappings (probably field explosion), which makes cluster state publishing dog slow when the full cluster state is to be published (which typically happens on node join). Can you check the cluster state size (GET _cluster/state, which includes information about the total compressed size of the cluster state)?

Just wondering the zen-disco-node-join event could be bumped up to IMMEDIATE from URGENT to allow nodes to join master

I'm not sure how that would help here. Both node joining and handling shard started events is treated with the same priority, i.e. handled fairly. As shard started tasks are batched together, this would at worse mean that the node join would have to wait for the publication of the previous batch of shard started events, but then get to execute right away.

Thanks @ywelsch @Christian_Dahlqvist
I am aware of the cluster state being huge 122 MB(90k shards) in size. The major concern is due to this stability of the cluster is getting compromised. Master becomes busy handling so many shards, unable to respond to node pings post which the node is removed from the cluster which tries to join the cluster and master then tries to serialize the entire cluster state and send it to the node. Let me know if my understanding is incorrect. What are the general guidelines to handle large number of shards/huge index mappings. Also from the stacktrace for hot threads it says that the full cluster state is coming from. Is it a part of node join?

org.elasticsearch.discovery.zen.PublishClusterStateAction.serializeFullClusterState(PublishClusterStateAction.java:355)
   org.elasticsearch.action.admin.cluster.state.TransportClusterStateAction.masterOperation(TransportClusterStateAction.java:136)
   org.elasticsearch.action.admin.cluster.state.TransportClusterStateAction.masterOperation(TransportClusterStateAction.java:44)
   org.elasticsearch.action.support.master.TransportMasterNodeAction.masterOperation(TransportMasterNodeAction.java:88)
   org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$2.doRun(TransportMasterNodeAction.java:174)
   org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
   org.elasticsearch.common.util.concurrent.EsExecutors$1.execute(EsExecutors.java:135)
   org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction.doStart(TransportMasterNodeAction.java:171)
   org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction.start(TransportMasterNodeAction.java:127)
   org.elasticsearch.action.support.master.TransportMasterNodeAction.doExecute(TransportMasterNodeAction.java:105)
   org.elasticsearch.action.support.master.TransportMasterNodeAction.doExecute(TransportMasterNodeAction.java:55)
   org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:167)
   org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:139)
   org.elasticsearch.action.support.HandledTransportAction$TransportHandler.messageReceived(HandledTransportAction.java:79)
   org.elasticsearch.action.support.HandledTransportAction$TransportHandler.messageReceived(HandledTransportAction.java:69)
   org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66)

Few questions that I have been unable to answer myself

  1. Why do I see so many hot threads on serializing full cluster state as cluster states updates for zen-disco-node-join events are processed sequentially
  2. I am assuming unicast zen discovery pings are processed by master on a separate thread. My concern is why do discovery pings fail to get processed. Is it purely related to network congestion. I have seen Possible SYN flooding on port 9300 through dmesg

There are a couple of ways I can see to address this. Based on the storage available to the instance types of mentioned you are using, it would seem that your average shard size is below 5GB, which is not very large. One option could therefore be to reduce the shard count by increasing the average size.

Although there is no defined limit to how large an Elasticsearch cluster can be, there is often a practical limit that is very use-case dependent. If you were to split your cluster up into a number of smaller ones, each would hold only a portion of the current cluster state and you could still query across them using cross-cluster search.

Is that really the case? Pings are handled on a separate thread, so it's not clear to me how slow cluster state publishing would affect those.

The number of shards in this cluster and the size of cluster state is beyond reasonable. As such, the best way forward is to drastically reduce the number of shards, and think about splitting this up into multiple clusters.

The stack trace you're showing here comes from a _cluster/state request (TransportClusterStateAction), i.e., a request initiated by a user (or by a different system), and has nothing to do with cluster state publication.

You might be confusing cluster states being serialized for publication and those serialized because of user requests to get the latest cluster state. (see the 2 callers of PublishClusterStateAction#serializeFullClusterState).

Do you have any concrete logs showing that?

This could be caused by the high number of nodes initially connecting to the master node (each node opens 13 connections against the master). You can try to increase kernel socket backlog limit to help with that.

The conclusion here should be to follow Christian's recommendations and size down the cluster.

[2018-10-01T07:23:16,814][WARN ][o.e.d.z.UnicastZenPing   ] [XilauFq] failed to send ping to [{CKVhOlp}{CKVhOlpgQvurbCeS3mYoQA}{DFep47SyTQGyC5c5fE-GwA}{ x.xxx.xx.xxx}{ x.xxx.xx.xxx:9300}{  zone=us-east-1c}]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [CKVhOlp][ x.xxx.xx.xxx:9300][internal:discovery/zen/unicast] request_id [676307] timed out after [6250ms]
  at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:987) [elasticsearch-6.3.1.jar:6.3.1]
  at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:626) [elasticsearch-6.3.1.jar:6.3.1]
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_172]
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_172]
  at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172]
[2018-10-01T07:23:18,480][WARN ][o.e.d.z.UnicastZenPing   ] [XilauFq] failed to send ping to [{CKVhOlp}{CKVhOlpgQvurbCeS3mYoQA}{DFep47SyTQGyC5c5fE-GwA}{ x.xxx.xx.xxx}{ x.xxx.xx.xxx:9300}{  zone=us-east-1c}]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [CKVhOlp][ x.xxx.xx.xxx:9300][internal:discovery/zen/unicast] request_id [676757] timed out after [6250ms]
  at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:987) [elasticsearch-6.3.1.jar:6.3.1]
  at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:626) [elasticsearch-6.3.1.jar:6.3.1]
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_172]
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_172]
  at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172]
[2018-10-01T07:23:18,632][DEBUG][o.e.d.z.PublishClusterStateAction] [XilauFq] received diff cluster state version [924] with uuid [ORIEcBlSTqO0t8tdh2th2w], diff size [310]
[2018-10-01T07:23:20,146][WARN ][o.e.d.z.UnicastZenPing   ] [XilauFq] failed to send ping to [{CKVhOlp}{CKVhOlpgQvurbCeS3mYoQA}{DFep47SyTQGyC5c5fE-GwA}{ x.xxx.xx.xxx}{ x.xxx.xx.xxx:9300}{  zone=us-east-1c}]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [CKVhOlp][ x.xxx.xx.xxx:9300][internal:discovery/zen/unicast] request_id [677153] timed out after [6251ms]
  at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:987) [elasticsearch-6.3.1.jar:6.3.1]
  at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:626) [elasticsearch-6.3.1.jar:6.3.1]
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_172]
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_172]
  at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172]


  [XilauFq] Received response for a request that has timed out, sent [924753ms] ago, timed out [918503ms] ago, action [internal:discovery/zen/unicast], node [{CKVhOlp}{CKVhOlpgQvurbCeS3mYoQA}{DFep47SyTQGyC5c5fE-GwA}{ x.xxx.xx.xxx}{ x.xxx.xx.xxx:9300}{  zone=us-east-1c}], id [577014]
[2018-10-01T07:24:27,649][WARN ][o.e.t.TransportService   ] [XilauFq] Received response for a request that has timed out, sent [705165ms] ago, timed out [698915ms] ago, action [internal:discovery/zen/unicast], node [{CKVhOlp}{CKVhOlpgQvurbCeS3mYoQA}{DFep47SyTQGyC5c5fE-GwA}{ x.xxx.xx.xxx}{ x.xxx.xx.xxx:9300}{  zone=us-east-1c}], id [598709]
[2018-10-01T07:24:27,650][WARN ][o.e.t.TransportService   ] [XilauFq] Received response for a request that has timed out, sent [693826ms] ago, timed out [687575ms] ago, action [internal:discovery/zen/unicast], node [{CKVhOlp}{CKVhOlpgQvurbCeS3mYoQA}{DFep47SyTQGyC5c5fE-GwA}{ x.xxx.xx.xxx}{ x.xxx.xx.xxx:9300}{  zone=us-east-1c}], id [600367]
[2018-10-01T07:24:27,651][WARN ][o.e.t.TransportService   ] [XilauFq] Received response for a request that has timed out, sent [676514ms] ago, timed out [670264ms] ago, action [internal:discovery/zen/unicast], node [{CKVhOlp}{CKVhOlpgQvurbCeS3mYoQA}{DFep47SyTQGyC5c5fE-GwA}{ x.xxx.xx.xxx}{ x.xxx.xx.xxx:9300}{  zone=us-east-1c}], id [603201]

Thanks @ywelsch,
I'll check on the logs for cluster state. So irrespective of the cluster state the discovery module should function normally within reasonable number of node.
An instance of response getting delayed. Above are logs from a data node

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.