ES Master suddenly losing connection to data nodes

Hi All,

I have 3 node cluster running ES v1.3.6 on AWS EC" instances. The cluster comprises of 1 master and 2 data nodes and has been running successfully for over 6 months. The data volumes are fairly small and I have other clusters with the same configuration with much larger data nodes.

Th master on this cluster has suddenly started being unresponsive to the Java client and I have to regularlyrestart the master node. There have been no application updates or configuration changes.

If I look in the log I see the following;

[2015-07-15 08:20:17,596][INFO ][node ] [prod-master-1] initialized
[2015-07-15 08:20:17,596][INFO ][node ] [prod-master-1] starting ...
[2015-07-15 08:20:17,729][INFO ][transport ] [prod-master-1] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/10.178.245.179:9300]}
[2015-07-15 08:20:17,732][INFO ][discovery ] [prod-master-1] common-event-log-prod/FEEscf5uRnSTHbFvOwE8RQ
[2015-07-15 08:20:23,063][INFO ][cluster.service ] [prod-master-1] new_master [prod-master-1][FEEscf5uRnSTHbFvOwE8RQ][localhost][inet[/10.178.245.179:9300]]{data=false, master=true}, reason: zen-disco-join (elected_as_master)
[2015-07-15 08:20:23,144][INFO ][http ] [prod-master-1] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/10.178.245.179:9200]}
[2015-07-15 08:20:23,145][INFO ][node ] [prod-master-1] started
[2015-07-15 08:20:23,469][INFO ][gateway ] [prod-master-1] recovered [111] indices into cluster_state
[2015-07-15 08:20:24,554][TRACE][transport.netty ] [prod-master-1] channel closed: [id: 0x8aa3e3cb, /10.178.245.172:59415 => /10.178.245.179:9300]
[2015-07-15 08:20:24,554][TRACE][transport.netty ] [prod-master-1] channel closed: [id: 0xe4ed85fe, /10.178.245.172:59413 => /10.178.245.179:9300]
[2015-07-15 08:20:24,557][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0xbe4b79f5, /10.178.245.172:59417 => /10.178.245.179:9300]
[2015-07-15 08:20:24,558][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0xde9f534e, /10.178.245.172:59418 => /10.178.245.179:9300]
[2015-07-15 08:20:24,558][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0x6ab12984, /10.178.245.172:59419 => /10.178.245.179:9300]
[2015-07-15 08:20:24,558][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0x890d88bf, /10.178.245.172:59420 => /10.178.245.179:9300]
[2015-07-15 08:20:24,558][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0xa0502f77, /10.178.245.172:59421 => /10.178.245.179:9300]
[2015-07-15 08:20:24,559][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0x96b0c4a1, /10.178.245.172:59422 => /10.178.245.179:9300]
[2015-07-15 08:20:24,559][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0x722347e7, /10.178.245.172:59423 => /10.178.245.179:9300]
[2015-07-15 08:20:24,559][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0x4cc89e3e, /10.178.245.172:59424 => /10.178.245.179:9300]
[2015-07-15 08:20:24,559][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0xe1d87bb3, /10.178.245.172:59425 => /10.178.245.179:9300]
[2015-07-15 08:20:24,559][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0x929214ca, /10.178.245.172:59426 => /10.178.245.179:9300]
[2015-07-15 08:20:24,560][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0x40319a1f, /10.178.245.172:59427 => /10.178.245.179:9300]
[2015-07-15 08:20:24,560][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0xd7d4a870, /10.178.245.172:59428 => /10.178.245.179:9300]
[2015-07-15 08:20:24,560][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0x80322de3, /10.178.245.172:59429 => /10.178.245.179:9300]
[2015-07-15 08:20:24,564][DEBUG][transport.netty ] [prod-master-1] connected to node [[prod-data-2][fPZd9JC3TZSxPAXNa4GNRQ][localhost][inet[/10.178.245.172:9300]]{master=false}]
[2015-07-15 08:20:24,589][INFO ][cluster.service ] [prod-master-1] added {[prod-data-2][fPZd9JC3TZSxPAXNa4GNRQ][localhost][inet[/10.178.245.172:9300]]{master=false},}, reason: zen-disco-receive(join from node[[prod-data-2][fPZd9JC3TZSxPAXNa4GNRQ][localhost][inet[/10.178.245.172:9300]]{master=false}])
[2015-07-15 08:20:24,773][TRACE][transport.netty ] [prod-master-1] channel closed: [id: 0x8bc37f3b, /10.178.245.137:39451 => /10.178.245.179:9300]
[2015-07-15 08:20:24,774][TRACE][transport.netty ] [prod-master-1] channel closed: [id: 0xe0b694c2, /10.178.245.137:39450 => /10.178.245.179:9300]
[2015-07-15 08:20:24,775][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0x453e5ed8, /10.178.245.137:39453 => /10.178.245.179:9300]
[2015-07-15 08:20:24,775][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0xacfd29bc, /10.178.245.137:39454 => /10.178.245.179:9300]
[2015-07-15 08:20:24,776][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0x4585c59d, /10.178.245.137:39455 => /10.178.245.179:9300]
[2015-07-15 08:20:24,776][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0xc857659b, /10.178.245.137:39456 => /10.178.245.179:9300]
[2015-07-15 08:20:24,776][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0x1dfd6ea1, /10.178.245.137:39457 => /10.178.245.179:9300]
[2015-07-15 08:20:24,777][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0xb1919fcd, /10.178.245.137:39458 => /10.178.245.179:9300]
[2015-07-15 08:20:24,777][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0x12589d2b, /10.178.245.137:39459 => /10.178.245.179:9300]
[2015-07-15 08:20:24,777][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0xc88d576a, /10.178.245.137:39460 => /10.178.245.179:9300]
[2015-07-15 08:20:24,777][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0xb7802480, /10.178.245.137:39461 => /10.178.245.179:9300]
[2015-07-15 08:20:24,777][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0xafbb4bd9, /10.178.245.137:39462 => /10.178.245.179:9300]
[2015-07-15 08:20:24,778][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0xf8e79d29, /10.178.245.137:39463 => /10.178.245.179:9300]
[2015-07-15 08:20:24,779][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0x91b740c5, /10.178.245.137:39464 => /10.178.245.179:9300]
[2015-07-15 08:20:24,779][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0xa6fc5d6e, /10.178.245.137:39465 => /10.178.245.179:9300]
[2015-07-15 08:20:24,780][DEBUG][transport.netty ] [prod-master-1] connected to node [[prod-data-1][eCF7PNqqQ7ect0baiHKlVA][localhost][inet[/10.178.245.137:9300]]{master=false}]
[2015-07-15 08:20:24,816][INFO ][cluster.service ] [prod-master-1] added {[prod-data-1][eCF7PNqqQ7ect0baiHKlVA][localhost][inet[/10.178.245.137:9300]]{master=false},}, reason: zen-disco-receive(join from node[[prod-data-1][eCF7PNqqQ7ect0baiHKlVA][localhost][inet[/10.178.245.137:9300]]{master=false}])
[2015-07-15 08:37:08,075][DEBUG][action.admin.cluster.node.stats] [prod-master-1] failed to execute on node [fPZd9JC3TZSxPAXNa4GNRQ]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [prod-data-2][inet[/10.178.245.172:9300]][cluster/nodes/stats/n] request_id [40898] timed out after [15000ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:356)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
[2015-07-15 08:38:38,075][DEBUG][action.admin.cluster.node.stats] [prod-master-1] failed to execute on node [fPZd9JC3TZSxPAXNa4GNRQ]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [prod-data-2][inet[/10.178.245.172:9300]][cluster/nodes/stats/n] request_id [44414] timed out after [15000ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:356)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
[2015-07-15 08:40:08,076][DEBUG][action.admin.cluster.node.stats] [prod-master-1] failed to execute on node [fPZd9JC3TZSxPAXNa4GNRQ]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [prod-data-2][inet[/10.178.245.172:9300]][cluster/nodes/stats/n] request_id [47930] timed out after [15000ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:356)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
[2015-07-15 08:40:38,076][DEBUG][action.admin.cluster.node.stats] [prod-master-1] failed to execute on node [eCF7PNqqQ7ect0baiHKlVA]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [prod-data-1][inet[/10.178.245.137:9300]][cluster/nodes/stats/n] request_id [49101] timed out after [15000ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:356)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
[2015-07-15 08:40:54,185][TRACE][transport.netty ] [prod-master-1] close connection exception caught on transport layer [[id: 0xe80a6999, /10.178.245.179:45904 => /10.178.245.137:9300]], disconnecting from relevant node
java.io.IOException: Connection timed out
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:64)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
[2015-07-15 08:40:54,185][TRACE][transport.netty ] [prod-master-1] close connection exception caught on transport layer [[id: 0x9e958f54, /10.178.245.179:45900 => /10.178.245.137:9300]], disconnecting from relevant node
java.io.IOException: Connection timed out
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:64)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
[2015-07-15 08:40:54,188][DEBUG][transport.netty ] [prod-master-1] disconnected from [[prod-data-1][eCF7PNqqQ7ect0baiHKlVA][localhost][inet[/10.178.245.137:9300]]{master=false}], channel closed event
[2015-07-15 08:40:54,190][DEBUG][action.admin.indices.stats] [prod-master-1] [common-event-log-2015-04-28][0], node[eCF7PNqqQ7ect0baiHKlVA], [R], s[STARTED]: failed to executed [org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@7b66d238]
org.elasticsearch.transport.NodeDisconnectedException: [prod-data-1][inet[/10.178.245.137:9300]][indices/stats/s] disconnected
[2015-07-15 08:40:54,190][DEBUG][action.admin.indices.stats] [prod-master-1] [common-event-log-2015-06-12][0], node[eCF7PNqqQ7ect0baiHKlVA], [R], s[STARTED]: failed to executed [org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@1aea3f91]
org.elasticsearch.transport.NodeDisconnectedException: [prod-data-1][inet[/10.178.245.137:9300]][indices/stats/s] disconnected
[2015-07-15 08:40:54,190][DEBUG][action.admin.indices.stats] [prod-master-1] [common-event-log-2015-04-16][0], node[eCF7PNqqQ7ect0baiHKlVA], [R], s[STARTED]: failed to executed [org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@1fad057a]

It appears that the master is having trouble connecting to both data nodes after a short period. The restart allows new documents to be indexed briefly but then the whole thing locks up again and the client times out.

Any pointers as to why this is happening or what I should be looking at to try to resolve this is appreciated.

Thanks in advance..

Kim

did you find a solution? I have the exact same issue..

I never did get to the root cause although I would check on disk space on the cluster. We had some free space so I didn't believe that this was the issue. However the solution was to take a backup of the data, re-provision the cluster with larger disks, which we needed to do anyway, and then restore the data. This fixed the issue.

Cheers,
Kim

hm.

you can check out my new post.

I think the solution is either the keepalive time of the tcp connections or a kernel bug.