I have the following elasticsearch setup :
- an elasticsearch master node on
coruscant
(and logstash). In the past this
node was a data node, but not anymore, since we have ... - two elasticsearch data nodes on
jangofett
andbobafett
- those three hosts (
coruscant
,jangofett
,bobafett
) see each
others on port 9300
I often experience timeouts when querying the cluster, as well as between nodes. Also status is red.
Here are relevant pieces of information.
Master node timing out when querying nodes ::
[2015-09-11 10:45:27,738][DEBUG][action.admin.cluster.node.stats] [coruscant] failed to execute on node [f29XN7asR6WL0aHIwWdjtw]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [bobafett][inet[/10.0.32.125:9300]][cluster:monitor/nodes/stats[n]] request_id [5357769] timed out after [15000ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:366)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
[2015-09-11 10:45:39,586][WARN ][gateway.local] [coruscant] [preprod-2015.06.30][4]: failed to list shard stores on node [f29XN7asR6WL0aHIwWdjtw]
org.elasticsearch.action.FailedNodeException: Failed node [f29XN7asR6WL0aHIwWdjtw]
at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.onFailure(TransportNodesOperationAction.java:206)
at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.access$1000(TransportNodesOperationAction.java:97)
at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction$4.handleException(TransportNodesOperationAction.java:178)
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:366)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
At the same moment, here is how bobafett
's logs looks ::
[2015-09-11 10:47:37,112][INFO ][monitor.jvm ] [bobafett] [gc][old][2770][505] duration [7.3s], collections [1]/[7.8s], total [7.3s]/[26.1m], memory [996.9mb]->[941mb]/[1007.3mb], all_pools {[young] [133.1mb]->[83.4mb]/[133.1mb]}{[survivor] [6.2mb]->[0b]/[16.6mb]}{[old] [857.6mb]->[857.6mb]/[857.6mb]}
[2015-09-11 10:47:43,464][WARN ][cluster.service ] [bobafett] failed to reconnect to node [logstash-ip-10-89-6-32-21177-13486][1jbHK_T7RpWfObZuYAiotw][ip-10-89-6-32][inet[/10.89.6.32:9301]]{client=true, data=false}
org.elasticsearch.transport.ConnectTransportException: [logstash-ip-10-89-6-32-21177-13486][inet[/10.89.6.32:9301]] connect_timeout[30s]
at org.elasticsearch.transport.netty.NettyTransport.connectToChannels(NettyTransport.java:807)
at org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:741)
at org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:714)
at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:150)
at org.elasticsearch.cluster.service.InternalClusterService$ReconnectToNodes.run(InternalClusterService.java:539)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.common.netty.channel.ConnectTimeoutException: connection timed out: /10.89.6.32:9301
at org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.processConnectTimeout(NioClientBoss.java:139)
at org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.process(NioClientBoss.java:83)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
at org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.run(NioClientBoss.java:42)
at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
... 3 more
[2015-09-11 10:47:51,645][INFO ][monitor.jvm] [bobafett] [gc][old][2773][507] duration [7.3s], collections [1]/[7.7s], total [7.3s]/[26.3m], memory [1003.3mb]->[952.5mb]/[1007.3mb], all_pools {[young] [133.1mb]->[94.9mb]/[133.1mb]}{[survivor] [12.5mb]->[0b]/[16.6mb]}{[old] [857.6mb]->[857.6mb]/[857.6mb]}
What can I do debug and fix this behaviour ?
Please tell me if I can provide any extra information