Hi All,
I have 3 node cluster running ES v1.3.6 on AWS EC" instances. The cluster comprises of 1 master and 2 data nodes and has been running successfully for over 6 months. The data volumes are fairly small and I have other clusters with the same configuration with much larger data nodes.
Th master on this cluster has suddenly started being unresponsive to the Java client and I have to regularlyrestart the master node. There have been no application updates or configuration changes.
If I look in the log I see the following;
[2015-07-15 08:20:17,596][INFO ][node ] [prod-master-1] initialized
[2015-07-15 08:20:17,596][INFO ][node ] [prod-master-1] starting ...
[2015-07-15 08:20:17,729][INFO ][transport ] [prod-master-1] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/10.178.245.179:9300]}
[2015-07-15 08:20:17,732][INFO ][discovery ] [prod-master-1] common-event-log-prod/FEEscf5uRnSTHbFvOwE8RQ
[2015-07-15 08:20:23,063][INFO ][cluster.service ] [prod-master-1] new_master [prod-master-1][FEEscf5uRnSTHbFvOwE8RQ][localhost][inet[/10.178.245.179:9300]]{data=false, master=true}, reason: zen-disco-join (elected_as_master)
[2015-07-15 08:20:23,144][INFO ][http ] [prod-master-1] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/10.178.245.179:9200]}
[2015-07-15 08:20:23,145][INFO ][node ] [prod-master-1] started
[2015-07-15 08:20:23,469][INFO ][gateway ] [prod-master-1] recovered [111] indices into cluster_state
[2015-07-15 08:20:24,554][TRACE][transport.netty ] [prod-master-1] channel closed: [id: 0x8aa3e3cb, /10.178.245.172:59415 => /10.178.245.179:9300]
[2015-07-15 08:20:24,554][TRACE][transport.netty ] [prod-master-1] channel closed: [id: 0xe4ed85fe, /10.178.245.172:59413 => /10.178.245.179:9300]
[2015-07-15 08:20:24,557][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0xbe4b79f5, /10.178.245.172:59417 => /10.178.245.179:9300]
[2015-07-15 08:20:24,558][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0xde9f534e, /10.178.245.172:59418 => /10.178.245.179:9300]
[2015-07-15 08:20:24,558][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0x6ab12984, /10.178.245.172:59419 => /10.178.245.179:9300]
[2015-07-15 08:20:24,558][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0x890d88bf, /10.178.245.172:59420 => /10.178.245.179:9300]
[2015-07-15 08:20:24,558][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0xa0502f77, /10.178.245.172:59421 => /10.178.245.179:9300]
[2015-07-15 08:20:24,559][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0x96b0c4a1, /10.178.245.172:59422 => /10.178.245.179:9300]
[2015-07-15 08:20:24,559][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0x722347e7, /10.178.245.172:59423 => /10.178.245.179:9300]
[2015-07-15 08:20:24,559][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0x4cc89e3e, /10.178.245.172:59424 => /10.178.245.179:9300]
[2015-07-15 08:20:24,559][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0xe1d87bb3, /10.178.245.172:59425 => /10.178.245.179:9300]
[2015-07-15 08:20:24,559][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0x929214ca, /10.178.245.172:59426 => /10.178.245.179:9300]
[2015-07-15 08:20:24,560][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0x40319a1f, /10.178.245.172:59427 => /10.178.245.179:9300]
[2015-07-15 08:20:24,560][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0xd7d4a870, /10.178.245.172:59428 => /10.178.245.179:9300]
[2015-07-15 08:20:24,560][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0x80322de3, /10.178.245.172:59429 => /10.178.245.179:9300]
[2015-07-15 08:20:24,564][DEBUG][transport.netty ] [prod-master-1] connected to node [[prod-data-2][fPZd9JC3TZSxPAXNa4GNRQ][localhost][inet[/10.178.245.172:9300]]{master=false}]
[2015-07-15 08:20:24,589][INFO ][cluster.service ] [prod-master-1] added {[prod-data-2][fPZd9JC3TZSxPAXNa4GNRQ][localhost][inet[/10.178.245.172:9300]]{master=false},}, reason: zen-disco-receive(join from node[[prod-data-2][fPZd9JC3TZSxPAXNa4GNRQ][localhost][inet[/10.178.245.172:9300]]{master=false}])
[2015-07-15 08:20:24,773][TRACE][transport.netty ] [prod-master-1] channel closed: [id: 0x8bc37f3b, /10.178.245.137:39451 => /10.178.245.179:9300]
[2015-07-15 08:20:24,774][TRACE][transport.netty ] [prod-master-1] channel closed: [id: 0xe0b694c2, /10.178.245.137:39450 => /10.178.245.179:9300]
[2015-07-15 08:20:24,775][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0x453e5ed8, /10.178.245.137:39453 => /10.178.245.179:9300]
[2015-07-15 08:20:24,775][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0xacfd29bc, /10.178.245.137:39454 => /10.178.245.179:9300]
[2015-07-15 08:20:24,776][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0x4585c59d, /10.178.245.137:39455 => /10.178.245.179:9300]
[2015-07-15 08:20:24,776][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0xc857659b, /10.178.245.137:39456 => /10.178.245.179:9300]
[2015-07-15 08:20:24,776][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0x1dfd6ea1, /10.178.245.137:39457 => /10.178.245.179:9300]
[2015-07-15 08:20:24,777][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0xb1919fcd, /10.178.245.137:39458 => /10.178.245.179:9300]
[2015-07-15 08:20:24,777][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0x12589d2b, /10.178.245.137:39459 => /10.178.245.179:9300]
[2015-07-15 08:20:24,777][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0xc88d576a, /10.178.245.137:39460 => /10.178.245.179:9300]
[2015-07-15 08:20:24,777][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0xb7802480, /10.178.245.137:39461 => /10.178.245.179:9300]
[2015-07-15 08:20:24,777][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0xafbb4bd9, /10.178.245.137:39462 => /10.178.245.179:9300]
[2015-07-15 08:20:24,778][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0xf8e79d29, /10.178.245.137:39463 => /10.178.245.179:9300]
[2015-07-15 08:20:24,779][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0x91b740c5, /10.178.245.137:39464 => /10.178.245.179:9300]
[2015-07-15 08:20:24,779][TRACE][transport.netty ] [prod-master-1] channel opened: [id: 0xa6fc5d6e, /10.178.245.137:39465 => /10.178.245.179:9300]
[2015-07-15 08:20:24,780][DEBUG][transport.netty ] [prod-master-1] connected to node [[prod-data-1][eCF7PNqqQ7ect0baiHKlVA][localhost][inet[/10.178.245.137:9300]]{master=false}]
[2015-07-15 08:20:24,816][INFO ][cluster.service ] [prod-master-1] added {[prod-data-1][eCF7PNqqQ7ect0baiHKlVA][localhost][inet[/10.178.245.137:9300]]{master=false},}, reason: zen-disco-receive(join from node[[prod-data-1][eCF7PNqqQ7ect0baiHKlVA][localhost][inet[/10.178.245.137:9300]]{master=false}])
[2015-07-15 08:37:08,075][DEBUG][action.admin.cluster.node.stats] [prod-master-1] failed to execute on node [fPZd9JC3TZSxPAXNa4GNRQ]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [prod-data-2][inet[/10.178.245.172:9300]][cluster/nodes/stats/n] request_id [40898] timed out after [15000ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:356)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
[2015-07-15 08:38:38,075][DEBUG][action.admin.cluster.node.stats] [prod-master-1] failed to execute on node [fPZd9JC3TZSxPAXNa4GNRQ]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [prod-data-2][inet[/10.178.245.172:9300]][cluster/nodes/stats/n] request_id [44414] timed out after [15000ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:356)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
[2015-07-15 08:40:08,076][DEBUG][action.admin.cluster.node.stats] [prod-master-1] failed to execute on node [fPZd9JC3TZSxPAXNa4GNRQ]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [prod-data-2][inet[/10.178.245.172:9300]][cluster/nodes/stats/n] request_id [47930] timed out after [15000ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:356)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
[2015-07-15 08:40:38,076][DEBUG][action.admin.cluster.node.stats] [prod-master-1] failed to execute on node [eCF7PNqqQ7ect0baiHKlVA]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [prod-data-1][inet[/10.178.245.137:9300]][cluster/nodes/stats/n] request_id [49101] timed out after [15000ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:356)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
[2015-07-15 08:40:54,185][TRACE][transport.netty ] [prod-master-1] close connection exception caught on transport layer [[id: 0xe80a6999, /10.178.245.179:45904 => /10.178.245.137:9300]], disconnecting from relevant node
java.io.IOException: Connection timed out
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:64)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
[2015-07-15 08:40:54,185][TRACE][transport.netty ] [prod-master-1] close connection exception caught on transport layer [[id: 0x9e958f54, /10.178.245.179:45900 => /10.178.245.137:9300]], disconnecting from relevant node
java.io.IOException: Connection timed out
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:64)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
[2015-07-15 08:40:54,188][DEBUG][transport.netty ] [prod-master-1] disconnected from [[prod-data-1][eCF7PNqqQ7ect0baiHKlVA][localhost][inet[/10.178.245.137:9300]]{master=false}], channel closed event
[2015-07-15 08:40:54,190][DEBUG][action.admin.indices.stats] [prod-master-1] [common-event-log-2015-04-28][0], node[eCF7PNqqQ7ect0baiHKlVA], [R], s[STARTED]: failed to executed [org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@7b66d238]
org.elasticsearch.transport.NodeDisconnectedException: [prod-data-1][inet[/10.178.245.137:9300]][indices/stats/s] disconnected
[2015-07-15 08:40:54,190][DEBUG][action.admin.indices.stats] [prod-master-1] [common-event-log-2015-06-12][0], node[eCF7PNqqQ7ect0baiHKlVA], [R], s[STARTED]: failed to executed [org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@1aea3f91]
org.elasticsearch.transport.NodeDisconnectedException: [prod-data-1][inet[/10.178.245.137:9300]][indices/stats/s] disconnected
[2015-07-15 08:40:54,190][DEBUG][action.admin.indices.stats] [prod-master-1] [common-event-log-2015-04-16][0], node[eCF7PNqqQ7ect0baiHKlVA], [R], s[STARTED]: failed to executed [org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@1fad057a]
It appears that the master is having trouble connecting to both data nodes after a short period. The restart allows new documents to be indexed briefly but then the whole thing locks up again and the client times out.
Any pointers as to why this is happening or what I should be looking at to try to resolve this is appreciated.
Thanks in advance..
Kim