Change the default master connection timeout of 30s to 60s

Hi, I am using Elasticsearch 5.5.2 version and when the node 2 is trying to connect to the master node node 1 I am getting the below error. I had opened this thread - Getting ConnectTimeoutException When joining in cluster Even if nodes are reachable and tried to clean up the data folder in node 2 but still seeing the same issue. Thinking if the node needs some more time to connect. Wanted to increase the default timeout of 30s to 60s. Is there any parameter I can change in the Elasticsearch.yml file to increase the default timeout. Please let me know
[WARN ][o.e.d.z.ZenDiscovery ] [rqnr5CF] failed to connect to master [{AhMmXxh}{AhMmXxhBRTGvK0DyD-CMuQ}{b2m73mOjQyi9xug0abOK9w}{node1 hostname}{node1 ip:9300}], retrying...
org.Elasticsearch.transport.ConnectTransportException: [AhMmXxh][node1 ip:9300] connect_timeout[30s]

It's very important that you upgrade.
8.0 has been released. I'd recommend going either to 7.17 or 8.0.

Hi David, Thanks for replying. In the latest version of our application we have upgraded to 6.8. But our customer is using an older version of our application that still runs on 5.5.2. Can you please let me know if we can increase the default timeout of 30s via any parameter in the yaml file.

You can may be increase that but you should instead fix the real problem...

What is the output of:

GET /
GET /_cat/nodes?v
GET /_cat/health?v
GET /_cat/indices?v

If some outputs are too big, please share them on gist.github.com and link them here.

1 Like

Hi David, The issue here is node 2 is not able to detect its master node 1. Below is the node1 (master) and the node2 health status.
node 1 # curl -k -XGET 'https://localhost:9200/_cluster/health?pretty'
{
"cluster_name" : "proj-Elasticsearch",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 275,
"active_shards" : 541,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 9,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 98.36363636363636
}

node 2 # curl -k -XGET 'https://localhost:9200/_cluster/health?pretty'
{
"cluster_name" : "proj-Elasticsearch",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 280,
"active_shards" : 551,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 9,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 98.39285714285714
}
I am seeing this error in the node 2 :[o.e.n.Node ] [dTijf8b] timed out while waiting for initial discovery state - timeout: 30s
[2022-02-10T18:04:22,614][WARN ][o.e.d.z.ZenDiscovery ] [dTijf8b] failed to connect to master [{46VbSuh}{46VbSuhJSb2dIoxQWNi77Q}{1d7fXPVfSoWPrBw4mySJ-Q}{iseD-pan-e1.wal-mart.com}{10.24.135.92:9300}], retrying...
org.Elasticsearch.transport.ConnectTransportException: [46VbSuh][10.24.135.92:9300] connect_timeout[30s]

Can you please let me know the parameter to add to increase the timeout

Have a look at Zen Discovery | Elasticsearch Reference [5.5] | Elastic

But you did not share all what I asked for.

Thanks David. So from the link you shared I understood that discovery.zen.fd.ping_timeout: 60s will help increase the timeout. Can you please let me know if we have to give the time like 60s or any other specifications.
I am trying to get access to the setup and will shortly share the data you are asking for.
Thank you!!

Hi David, From node 2 i am getting the below error when i run the cat health/cat nodes command

{
  "error" : {
    "root_cause" : [ {
      "type" : "master_not_discovered_exception",
      "reason" : null
    } ],
    "type" : "master_not_discovered_exception",
    "reason" : null
  },
  "status" : 503
}

node 1 - 
{
"cluster_name" : "proj-Elasticsearch",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 275,
"active_shards" : 541,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 9,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 98.36363636363636
}

cat/nodes
Xx.XX.xx.xx  83 2 0.42 0.61 0.65 mdi * 0poAoxs

Let me know if u need any data

Actaully node 1 is the master now, is there anyway we can remove node 2 from the es cluster and add it back again, so it will be able to detect its master

You have only 2 nodes?

If so, it's against the best practices. And here you are probably suffering from a split brain.
If everything is replicated, I'd probably delete all the data in node2 and restart it again so he can join the node1 again and form a cluster.
You need to set minimum master nodes setting as well to 2 to prevent another split brain situation.
And add a 3rd node.

But again, you must upgrade your clusters. Those are not safe and resilient IMO.

Yes in our current model we have only two nodes, we actually did delete all the data in node 2 and restarted but still node 2 is not able to recognize its master. As mentioned our customer is using older version. Is there any thing else we can do to get out of this issue.

Could you share the full logs of both nodes please?

Please format your code, logs or configuration files using </> icon as explained in this guide and not the citation button. It will make your post more readable.

Or use markdown style like:

```
CODE
```

This is the icon to use if you are not using markdown format:

There's a live preview panel for exactly this reasons.

Lots of people read these forums, and many of them will simply skip over a post that is difficult to read, because it's just too large an investment of their time to try and follow a wall of badly formatted text.
If your goal is to get an answer to your questions, it's in your interest to make it as easy to read and understand as possible.

Hi David, I don't see an upload button, is there anywhere I can upload the log file.

Node 1 (master logs)

[2022-02-10T18:03:21,420][INFO ][o.e.p.s.t.SSLNettyTransport] [46VbSuh]  After ch.pipeline().addFirst
[2022-02-10T18:03:21,439][INFO ][o.e.c.s.ClusterService   ] [46VbSuh] new_master {46VbSuh}{46VbSuhJSb2dIoxQWNi77Q}{1d7fXPVfSoWPrBw4mySJ-Q}{iseD-pan-e1.wal-mart.com}{10.24.135.92:9300}, reason: zen-disco-elected-as-master ([0] nodes joined)
[2022-02-10T18:03:21,484][INFO ][o.e.p.s.h.SSLNettyHttpServerTransport] [46VbSuh] [TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256, TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384, TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA, TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA, TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256, TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384, TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256, TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA, TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA384, TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA, TLS_DHE_RSA_WITH_AES_128_CBC_SHA256, TLS_DHE_RSA_WITH_AES_256_CBC_SHA256, TLS_DHE_RSA_WITH_AES_128_CBC_SHA, TLS_DHE_RSA_WITH_AES_256_CBC_SHA, TLS_RSA_WITH_AES_128_CBC_SHA256, TLS_RSA_WITH_AES_256_CBC_SHA256, TLS_RSA_WITH_AES_128_CBC_SHA, TLS_RSA_WITH_AES_256_CBC_SHA]Enable Cipher
[2022-02-10T18:03:21,484][INFO ][o.e.p.s.h.SSLNettyHttpServerTransport] [46VbSuh] [TLSv1.2]Enable protocols
[2022-02-10T18:33:55,601][WARN ][o.e.p.s.t.SSLNettyTransport] [aGl4jIE] exception caught on transport layer [[id: 0x26f174e0, L:0.0.0.0/0.0.0.0:9300 ! R:/10.226.5.92:50557]], closing connection
io.netty.handler.codec.DecoderException: io.netty.handler.ssl.NotSslRecordException: not an SSL/TLS record: 0d0afff4fffd06
        at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:459) ~[netty-codec-4.1.29.Final.jar:4.1.29.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265) ~[netty-codec-4.1.29.Final.jar:4.1.29.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.29.Final.jar:4.1.29.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.29.Final.jar:4.1.29.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.29.Final.jar:4.1.29.Final]
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1434) [netty-transport-4.1.29.Final.jar:4.1.29.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.29.Final.jar:4.1.29.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.29.Final.jar:4.1.29.Final]
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:965) [netty-transport-4.1.29.Final.jar:4.1.29.Final]
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) [netty-transport-4.1.29.Final.jar:4.1.29.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:628) [netty-transport-4.1.29.Final.jar:4.1.29.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:528) [netty-transport-4.1.29.Final.jar:4.1.29.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:482) [netty-transport-4.1.29.Final.jar:4.1.29.Final]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442) [netty-transport-4.1.29.Final.jar:4.1.29.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884) [netty-common-4.1.29.Final.jar:4.1.29.Final]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_242]
Caused by: io.netty.handler.ssl.NotSslRecordException: not an SSL/TLS record: 0d0afff4fffd06
        at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1178) ~[?:?]
        at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1243) ~[?:?]
        at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:489) ~[?:?]
        at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:428) ~[?:?]
        ... 15 more
[2022-02-10T18:33:56,372][INFO ][o.e.p.s.t.SSLNettyTransport] [aGl4jIE] Intializing SSL context for ES Transport Client's server  with 1 KeyManagers and 1 TrustManagers

Node 2 logs

[2022-02-10T18:04:17,131][INFO ][o.e.p.s.t.SSLNettyTransport] [dTijf8b]  After new SslHandler(engine)
[2022-02-10T18:04:17,131][INFO ][o.e.p.s.t.SSLNettyTransport] [dTijf8b]  After ch.pipeline().addFirst
[2022-02-10T18:04:19,447][WARN ][o.e.n.Node               ] [dTijf8b] timed out while waiting for initial discovery state - timeout: 30s
[2022-02-10T18:04:19,467][INFO ][o.e.p.s.h.SSLNettyHttpServerTransport] [dTijf8b] [TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256, TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384, TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA, TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA, TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256, TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384, TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256, TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA, TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA384, TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA, TLS_DHE_RSA_WITH_AES_128_CBC_SHA256, TLS_DHE_RSA_WITH_AES_256_CBC_SHA256, TLS_DHE_RSA_WITH_AES_128_CBC_SHA, TLS_DHE_RSA_WITH_AES_256_CBC_SHA, TLS_RSA_WITH_AES_128_CBC_SHA256, TLS_RSA_WITH_AES_256_CBC_SHA256, TLS_RSA_WITH_AES_128_CBC_SHA, TLS_RSA_WITH_AES_256_CBC_SHA]Enable Cipher
[2022-02-10T18:04:19,467][INFO ][o.e.p.s.h.SSLNettyHttpServerTransport] [dTijf8b] [TLSv1.2]Enable protocols
[2022-02-10T18:04:19,498][INFO ][o.e.p.s.h.SSLNettyHttpServerTransport] [dTijf8b] publish_address {127.0.0.1:9200}, bound_addresses {[::1]:9200}, {127.0.0.1:9200}
[2022-02-10T18:04:19,498][INFO ][o.e.n.Node               ] [dTijf8b] started
[2022-02-10T18:04:19,549][INFO ][o.e.p.s.h.SSLNettyHttpServerTransport] [dTijf8b] Inside  initChannel of SSLHttpChannelHandler
[2022-02-10T18:04:22,188][INFO ][o.e.p.s.t.SSLNettyTransport] [dTijf8b] Intializing SSL context for ES Transport Client's server  with 1 KeyManagers and 1 TrustManagers
[2022-02-10T18:04:22,189][INFO ][o.e.p.s.t.SSLNettyTransport] [dTijf8b]  After serverContext.init()
[2022-02-10T18:04:22,189][INFO ][o.e.p.s.t.SSLNettyTransport] [dTijf8b]  After  serverContext.createSSLEngine()
[2022-02-10T18:04:22,189][INFO ][o.e.p.s.t.SSLNettyTransport] [dTijf8b]  After  new SSLParameters();
[2022-02-10T18:04:22,189][INFO ][o.e.p.s.t.SSLNettyTransport] [dTijf8b]  AftersslParams.setCipherSuites[Ljava.lang.String;@37ec689d
[2022-02-10T18:04:22,189][INFO ][o.e.p.s.t.SSLNettyTransport] [dTijf8b]  AftersslParams.setProtocols[Ljava.lang.String;@773cbe39
[2022-02-10T18:04:22,189][INFO ][o.e.p.s.t.SSLNettyTransport] [dTijf8b]  AftersslParams.setSSLParametersjavax.net.ssl.SSLParameters@14b33f5d
[2022-02-10T18:04:22,189][INFO ][o.e.p.s.t.SSLNettyTransport] [dTijf8b]  After new SslHandler(engine)
[2022-02-10T18:04:22,189][INFO ][o.e.p.s.t.SSLNettyTransport] [dTijf8b]  After ch.pipeline().addFirst
[2022-02-10T18:04:22,614][WARN ][o.e.d.z.ZenDiscovery     ] [dTijf8b] failed to connect to master [{46VbSuh}{46VbSuhJSb2dIoxQWNi77Q}{1d7fXPVfSoWPrBw4mySJ-Q}{iseD-pan-e1.wal-mart.com}{10.24.135.92:9300}], retrying...
org.elasticsearch.transport.ConnectTransportException: [46VbSuh][10.24.135.92:9300] connect_timeout[30s]
        at org.elasticsearch.transport.netty4.Netty4Transport.connectToChannels(Netty4Transport.java:361) ~[?:?]
        at org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:548) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.transport.TcpTransport.connectToNode(TcpTransport.java:472) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:332) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:319) ~[elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.discovery.zen.ZenDiscovery.joinElectedMaster(ZenDiscovery.java:459) [elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.discovery.zen.ZenDiscovery.innerJoinCluster(ZenDiscovery.java:411) [elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.discovery.zen.ZenDiscovery.access$4100(ZenDiscovery.java:83) [elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.discovery.zen.ZenDiscovery$JoinThreadControl$1.run(ZenDiscovery.java:1188) [elasticsearch-5.5.2.jar:5.5.2]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) [elasticsearch-5.5.2.jar:5.5.2]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_242]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_242]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_242]
Caused by: io.netty.channel.ConnectTimeoutException: connection timed out: iseD-pan-e1.wal-mart.com/10.24.135.92:9300
        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:267) ~[?:?]
        at io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38) ~[?:?]
        at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:127) ~[?:?]
        at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) ~[?:?]
        at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:446) ~[?:?]
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884) ~[?:?]
        ... 1 more
[2022-02-10T18:04:27,256][INFO ][o.e.p.s.t.SSLNettyTransport] [dTijf8b] Intializing SSL context for ES Transport Client's server  with 1 KeyManagers and 1 TrustManagers

Also, I want to reproduce master not found exception in node2 in my setup, can you share any tweaks that can be done to the master node or node 2 for node 2 to show this exception. I tried disabling the 9300 port of the master node but that did not give me this exception

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.