Connection timeout between nodes

I have the following elasticsearch setup :

  • an elasticsearch master node on coruscant (and logstash). In the past this
    node was a data node, but not anymore, since we have ...
  • two elasticsearch data nodes on jangofett and bobafett
  • those three hosts (coruscant, jangofett, bobafett) see each
    others on port 9300

I often experience timeouts when querying the cluster, as well as between nodes. Also status is red.
Here are relevant pieces of information.

Master node timing out when querying nodes ::

[2015-09-11 10:45:27,738][DEBUG][action.admin.cluster.node.stats] [coruscant] failed to execute on node [f29XN7asR6WL0aHIwWdjtw]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [bobafett][inet[/10.0.32.125:9300]][cluster:monitor/nodes/stats[n]] request_id [5357769] timed out after [15000ms]
    at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:366)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

[2015-09-11 10:45:39,586][WARN ][gateway.local] [coruscant] [preprod-2015.06.30][4]: failed to list shard stores on node [f29XN7asR6WL0aHIwWdjtw]
org.elasticsearch.action.FailedNodeException: Failed node [f29XN7asR6WL0aHIwWdjtw]
at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.onFailure(TransportNodesOperationAction.java:206)
at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.access$1000(TransportNodesOperationAction.java:97)
at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction$4.handleException(TransportNodesOperationAction.java:178)
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:366)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

At the same moment, here is how bobafett's logs looks ::

     [2015-09-11 10:47:37,112][INFO ][monitor.jvm ] [bobafett] [gc][old][2770][505] duration [7.3s], collections [1]/[7.8s], total [7.3s]/[26.1m], memory [996.9mb]->[941mb]/[1007.3mb], all_pools {[young] [133.1mb]->[83.4mb]/[133.1mb]}{[survivor] [6.2mb]->[0b]/[16.6mb]}{[old] [857.6mb]->[857.6mb]/[857.6mb]}
     [2015-09-11 10:47:43,464][WARN ][cluster.service ] [bobafett] failed to reconnect to node [logstash-ip-10-89-6-32-21177-13486][1jbHK_T7RpWfObZuYAiotw][ip-10-89-6-32][inet[/10.89.6.32:9301]]{client=true, data=false}
     org.elasticsearch.transport.ConnectTransportException: [logstash-ip-10-89-6-32-21177-13486][inet[/10.89.6.32:9301]] connect_timeout[30s]
         at org.elasticsearch.transport.netty.NettyTransport.connectToChannels(NettyTransport.java:807)
         at org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:741)
         at org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:714)
         at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:150)
         at org.elasticsearch.cluster.service.InternalClusterService$ReconnectToNodes.run(InternalClusterService.java:539)
         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
         at java.lang.Thread.run(Thread.java:745)
 Caused by: org.elasticsearch.common.netty.channel.ConnectTimeoutException: connection timed out: /10.89.6.32:9301
         at org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.processConnectTimeout(NioClientBoss.java:139)
         at org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.process(NioClientBoss.java:83)
         at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
         at org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.run(NioClientBoss.java:42)
         at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
         at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
         ... 3 more
     [2015-09-11 10:47:51,645][INFO ][monitor.jvm] [bobafett] [gc][old][2773][507] duration [7.3s], collections [1]/[7.7s], total [7.3s]/[26.3m], memory [1003.3mb]->[952.5mb]/[1007.3mb], all_pools {[young] [133.1mb]->[94.9mb]/[133.1mb]}{[survivor] [12.5mb]->[0b]/[16.6mb]}{[old] [857.6mb]->[857.6mb]/[857.6mb]}

What can I do debug and fix this behaviour ?

Please tell me if I can provide any extra information

So I hit the 5K-char limit way to early and some info were missing :

Global status :

GET /_cluster/health
{
  "unassigned_shards": 1535,
  "initializing_shards": 6,
  "relocating_shards": 0,
  "active_shards": 1501,
  "active_primary_shards": 1499,
  "number_of_data_nodes": 2,
  "number_of_nodes": 4,
  "timed_out": false,
  "status": "red",
  "cluster_name": "elasticsearch"
}

Are these all in the same datacenter?

All nodes are hosted on AWS EC2. They are in the same zone but not the same datacenter (eu-west-1a, eu-west-1b, eu-west-1c)