ES Master suddenly losing connection to data nodes

kimbad · July 15, 2015, 10:18am

Hi All,

I have 3 node cluster running ES v1.3.6 on AWS EC" instances. The cluster comprises of 1 master and 2 data nodes and has been running successfully for over 6 months. The data volumes are fairly small and I have other clusters with the same configuration with much larger data nodes.

Th master on this cluster has suddenly started being unresponsive to the Java client and I have to regularlyrestart the master node. There have been no application updates or configuration changes.

If I look in the log I see the following;

[2015-07-15 08:20:17,596][INFO ][node ] [prod-master-1] initialized
[2015-07-15 08:20:17,596][INFO ][node ] [prod-master-1] starting ...
[2015-07-15 08:20:17,729][INFO ][transport ] [prod-master-1] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/10.178.245.179:9300]}
[2015-07-15 08:20:17,732][INFO ][discovery ] [prod-master-1] common-event-log-prod/FEEscf5uRnSTHbFvOwE8RQ
[2015-07-15 08:20:23,063][INFO ][cluster.service ] [prod-master-1] new_master [prod-master-1][FEEscf5uRnSTHbFvOwE8RQ][localhost][inet[/10.178.245.179:9300]]{data=false, master=true}, reason: zen-disco-join (elected_as_master)
[2015-07-15 08:20:23,144][INFO ][http ] [prod-master-1] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/10.178.245.179:9200]}
[2015-07-15 08:20:23,145][INFO ][node ] [prod-master-1] started
[2015-07-15 08:20:23,469][INFO ][gateway ] [prod-master-1] recovered [111] indices into cluster_state
[2015-07-15 08:20:24,554][TRACE][transport.netty ] [prod-master-1] channel closed: [id: 0x8aa3e3cb, /10.178.245.172:59415 => /10.178.245.179:9300]
[2015-07-15 08:20:24,554][TRACE][transport.netty ] [prod-master-1] channel closed: [id: 0xe4ed85fe, /10.178.245.172:59413 => /10.178.245.179:9300]
[2015-07-15 08:20:24,557][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0xbe4b79f5, /10.178.245.172:59417 => /10.178.245.179:9300]
[2015-07-15 08:20:24,558][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0xde9f534e, /10.178.245.172:59418 => /10.178.245.179:9300]
[2015-07-15 08:20:24,558][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0x6ab12984, /10.178.245.172:59419 => /10.178.245.179:9300]
[2015-07-15 08:20:24,558][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0x890d88bf, /10.178.245.172:59420 => /10.178.245.179:9300]
[2015-07-15 08:20:24,558][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0xa0502f77, /10.178.245.172:59421 => /10.178.245.179:9300]
[2015-07-15 08:20:24,559][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0x96b0c4a1, /10.178.245.172:59422 => /10.178.245.179:9300]
[2015-07-15 08:20:24,559][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0x722347e7, /10.178.245.172:59423 => /10.178.245.179:9300]
[2015-07-15 08:20:24,559][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0x4cc89e3e, /10.178.245.172:59424 => /10.178.245.179:9300]
[2015-07-15 08:20:24,559][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0xe1d87bb3, /10.178.245.172:59425 => /10.178.245.179:9300]
[2015-07-15 08:20:24,559][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0x929214ca, /10.178.245.172:59426 => /10.178.245.179:9300]
[2015-07-15 08:20:24,560][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0x40319a1f, /10.178.245.172:59427 => /10.178.245.179:9300]
[2015-07-15 08:20:24,560][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0xd7d4a870, /10.178.245.172:59428 => /10.178.245.179:9300]
[2015-07-15 08:20:24,560][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0x80322de3, /10.178.245.172:59429 => /10.178.245.179:9300]
[2015-07-15 08:20:24,564][DEBUG][transport.netty ] [prod-master-1] connected to node [[prod-data-2][fPZd9JC3TZSxPAXNa4GNRQ][localhost][inet[/10.178.245.172:9300]]{master=false}]
[2015-07-15 08:20:24,589][INFO ][cluster.service ] [prod-master-1] added {[prod-data-2][fPZd9JC3TZSxPAXNa4GNRQ][localhost][inet[/10.178.245.172:9300]]{master=false},}, reason: zen-disco-receive(join from node[[prod-data-2][fPZd9JC3TZSxPAXNa4GNRQ][localhost][inet[/10.178.245.172:9300]]{master=false}])
[2015-07-15 08:20:24,773][TRACE][transport.netty ] [prod-master-1] channel closed: [id: 0x8bc37f3b, /10.178.245.137:39451 => /10.178.245.179:9300]
[2015-07-15 08:20:24,774][TRACE][transport.netty ] [prod-master-1] channel closed: [id: 0xe0b694c2, /10.178.245.137:39450 => /10.178.245.179:9300]
[2015-07-15 08:20:24,775][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0x453e5ed8, /10.178.245.137:39453 => /10.178.245.179:9300]
[2015-07-15 08:20:24,775][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0xacfd29bc, /10.178.245.137:39454 => /10.178.245.179:9300]
[2015-07-15 08:20:24,776][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0x4585c59d, /10.178.245.137:39455 => /10.178.245.179:9300]
[2015-07-15 08:20:24,776][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0xc857659b, /10.178.245.137:39456 => /10.178.245.179:9300]
[2015-07-15 08:20:24,776][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0x1dfd6ea1, /10.178.245.137:39457 => /10.178.245.179:9300]
[2015-07-15 08:20:24,777][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0xb1919fcd, /10.178.245.137:39458 => /10.178.245.179:9300]
[2015-07-15 08:20:24,777][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0x12589d2b, /10.178.245.137:39459 => /10.178.245.179:9300]
[2015-07-15 08:20:24,777][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0xc88d576a, /10.178.245.137:39460 => /10.178.245.179:9300]
[2015-07-15 08:20:24,777][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0xb7802480, /10.178.245.137:39461 => /10.178.245.179:9300]
[2015-07-15 08:20:24,777][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0xafbb4bd9, /10.178.245.137:39462 => /10.178.245.179:9300]
[2015-07-15 08:20:24,778][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0xf8e79d29, /10.178.245.137:39463 => /10.178.245.179:9300]
[2015-07-15 08:20:24,779][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0x91b740c5, /10.178.245.137:39464 => /10.178.245.179:9300]
[2015-07-15 08:20:24,779][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0xa6fc5d6e, /10.178.245.137:39465 => /10.178.245.179:9300]
[2015-07-15 08:20:24,780][DEBUG][transport.netty ] [prod-master-1] connected to node [[prod-data-1][eCF7PNqqQ7ect0baiHKlVA][localhost][inet[/10.178.245.137:9300]]{master=false}]
[2015-07-15 08:20:24,816][INFO ][cluster.service ] [prod-master-1] added {[prod-data-1][eCF7PNqqQ7ect0baiHKlVA][localhost][inet[/10.178.245.137:9300]]{master=false},}, reason: zen-disco-receive(join from node[[prod-data-1][eCF7PNqqQ7ect0baiHKlVA][localhost][inet[/10.178.245.137:9300]]{master=false}])
[2015-07-15 08:37:08,075][DEBUG][action.admin.cluster.node.stats] [prod-master-1] failed to execute on node [fPZd9JC3TZSxPAXNa4GNRQ]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [prod-data-2][inet[/10.178.245.172:9300]][cluster/nodes/stats/n] request_id [40898] timed out after [15000ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:356)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
[2015-07-15 08:38:38,075][DEBUG][action.admin.cluster.node.stats] [prod-master-1] failed to execute on node [fPZd9JC3TZSxPAXNa4GNRQ]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [prod-data-2][inet[/10.178.245.172:9300]][cluster/nodes/stats/n] request_id [44414] timed out after [15000ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:356)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
[2015-07-15 08:40:08,076][DEBUG][action.admin.cluster.node.stats] [prod-master-1] failed to execute on node [fPZd9JC3TZSxPAXNa4GNRQ]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [prod-data-2][inet[/10.178.245.172:9300]][cluster/nodes/stats/n] request_id [47930] timed out after [15000ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:356)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
[2015-07-15 08:40:38,076][DEBUG][action.admin.cluster.node.stats] [prod-master-1] failed to execute on node [eCF7PNqqQ7ect0baiHKlVA]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [prod-data-1][inet[/10.178.245.137:9300]][cluster/nodes/stats/n] request_id [49101] timed out after [15000ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:356)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
[2015-07-15 08:40:54,185][TRACE][transport.netty ] [prod-master-1] close connection exception caught on transport layer [[id: 0xe80a6999, /10.178.245.179:45904 => /10.178.245.137:9300]], disconnecting from relevant node
java.io.IOException: Connection timed out
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:64)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
[2015-07-15 08:40:54,185][TRACE][transport.netty ] [prod-master-1] close connection exception caught on transport layer [[id: 0x9e958f54, /10.178.245.179:45900 => /10.178.245.137:9300]], disconnecting from relevant node
java.io.IOException: Connection timed out
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:64)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
[2015-07-15 08:40:54,188][DEBUG][transport.netty ] [prod-master-1] disconnected from [[prod-data-1][eCF7PNqqQ7ect0baiHKlVA][localhost][inet[/10.178.245.137:9300]]{master=false}], channel closed event
[2015-07-15 08:40:54,190][DEBUG][action.admin.indices.stats] [prod-master-1] [common-event-log-2015-04-28][0], node[eCF7PNqqQ7ect0baiHKlVA], [R], s[STARTED]: failed to executed [org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@7b66d238]
org.elasticsearch.transport.NodeDisconnectedException: [prod-data-1][inet[/10.178.245.137:9300]][indices/stats/s] disconnected
[2015-07-15 08:40:54,190][DEBUG][action.admin.indices.stats] [prod-master-1] [common-event-log-2015-06-12][0], node[eCF7PNqqQ7ect0baiHKlVA], [R], s[STARTED]: failed to executed [org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@1aea3f91]
org.elasticsearch.transport.NodeDisconnectedException: [prod-data-1][inet[/10.178.245.137:9300]][indices/stats/s] disconnected
[2015-07-15 08:40:54,190][DEBUG][action.admin.indices.stats] [prod-master-1] [common-event-log-2015-04-16][0], node[eCF7PNqqQ7ect0baiHKlVA], [R], s[STARTED]: failed to executed [org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@1fad057a]

It appears that the master is having trouble connecting to both data nodes after a short period. The restart allows new documents to be indexed briefly but then the whole thing locks up again and the client times out.

Any pointers as to why this is happening or what I should be looking at to try to resolve this is appreciated.

Thanks in advance..

Kim

lwintergerst · October 28, 2015, 1:14pm

did you find a solution? I have the exact same issue..

kimbad · October 28, 2015, 3:03pm

I never did get to the root cause although I would check on disk space on the cluster. We had some free space so I didn't believe that this was the issue. However the solution was to take a backup of the data, re-provision the cluster with larger disks, which we needed to do anyway, and then restore the data. This fixed the issue.

Cheers,
Kim

lwintergerst · October 28, 2015, 3:05pm

hm.

you can check out my new post.

I think the solution is either the keepalive time of the tcp connections or a kernel bug.

Topic		Replies	Views
ES goes on RED with master node left Elasticsearch	4	448	January 3, 2019
Failed to send join request to master Elasticsearch	2	479	July 6, 2017
Data node left the cluster due to `master_left - failed to ping, tried [3] times, each with maximum [30s] timeout]` Elasticsearch	6	5551	February 4, 2019
Nodes leaves cluster and rejoin after sometime Elasticsearch	3	668	March 20, 2018
ES datanode can't join the cluster since disconnected from master Elasticsearch	7	1985	December 28, 2020

ES Master suddenly losing connection to data nodes

Related topics