Clustering problem with ES 2.x

I am trying to form an ES clsuter in AWS private subnets

When I add second node, its getting added to the cluster, but only some shards are getting allocated

I see these errors in logs
`[2016-05-06 13:43:16,921][TRACE][discovery.zen.ping.unicast] [Llyron] [7] failed to connect to {#zen_unicast_8#}{::1}{[::1]:9302}
ConnectTransportException[[][[::1]:9302] connect_timeout[30s]]; nested: ConnectException[Connection refused: /0:0:0:0:0:0:0:1:9302];
at org.elasticsearch.transport.netty.NettyTransport.connectToChannelsLight(NettyTransport.java:916)
at org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:880)
at org.elasticsearch.transport.netty.NettyTransport.connectToNodeLight(NettyTransport.java:852)
at org.elasticsearch.transport.TransportService.connectToNodeLight(TransportService.java:250)
at org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing$3.run(UnicastZenPing.java:395)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.ConnectException: Connection refused: /0:0:0:0:0:0:0:1:9302
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.jboss.netty.channel.socket.nio.NioClientBoss.connect(NioClientBoss.java:152)
at org.jboss.netty.channel.socket.nio.NioClientBoss.processSelectedKeys(NioClientBoss.java:105)
at org.jboss.netty.channel.socket.nio.NioClientBoss.process(NioClientBoss.java:79)
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
at org.jboss.netty.channel.socket.nio.NioClientBoss.run(NioClientBoss.java:42)
at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
... 3 more
[2016-05-06 13:43:16,921][TRACE][discovery.zen.ping.unicast] [Llyron] [7] failed to connect to {#zen_unicast_7#}{::1}{[::1]:9301}
ConnectTransportException[[][[::1]:9301] connect_timeout[30s]]; nested: ConnectException[Connection refused: /0:0:0:0:0:0:0:1:9301];
at org.elasticsearch.transport.netty.NettyTransport.connectToChannelsLight(NettyTransport.java:916)
at org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:880)
at org.elasticsearch.transport.netty.NettyTransport.connectToNodeLight(NettyTransport.java:852)
at org.elasticsearch.transport.TransportService.connectToNodeLight(TransportService.java:250)
at org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing$3.run(UnicastZenPing.java:395)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.ConnectException: Connection refused: /0:0:0:0:0:0:0:1:9301
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.jboss.netty.channel.socket.nio.NioClientBoss.connect(NioClientBoss.java:152)
at org.jboss.netty.channel.socket.nio.NioClientBoss.processSelectedKeys(NioClientBoss.java:105)
at org.jboss.netty.channel.socket.nio.NioClientBoss.process(NioClientBoss.java:79)
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
at org.jboss.netty.channel.socket.nio.NioClientBoss.run(NioClientBoss.java:42)
at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
... 3 more
ockProofWorker.java:42)
... 3 more

network.host: "10.0.136.205"
action.destructive_requires_name: false
script.inline: on
script.indexed: on

node.master: true
node.data: true
index.number_of_shards: "5"
index.number_of_replicas: 1
path.conf: /etc/elasticsearch
path.work: /tmp/elasticsearch
path.data: /data/elasticsearch
path.logs: /data/logs/elasticsearch

bootstrap.mlockall: true
discovery.zen.minimum_master_nodes: 1

# EC2 discovery allows to use AWS EC2 API in order to perform discovery.
cloud:
    aws:
        protocol: https
        region: ap-southeast-1
discovery.type: ec2

any ideas what might be going on here?

Can you telnet between nodes on the transport protocol?

Hello,

sorry for the late reply.

Yes I can telnet between nodes.

[root@ip-10-0-128-126 ~]# curl localhost:9200/_cat/nodes
10.0.128.32 10.0.128.32 8 84 0.12 d m Solomon O'Sullivan
10.0.128.126 10.0.128.126 7 69 0.00 d * Anelle
[root@ip-10-0-128-126 ~]# telnet 10.0.128.32 9300
Trying 10.0.128.32...
Connected to 10.0.128.32.
Escape character is '^]'.

Logs from the node that have partially allocated shards

[2016-05-18 08:35:50,345][DEBUG][org.apache.http.impl.conn.PoolingClientConnectionManager] Connection [id: 0][route: {s}->https://ec2.ap-southeast-1.amazonaws.com] can be kept alive for 60000 MILLISECONDS
[2016-05-18 08:35:50,345][DEBUG][org.apache.http.impl.conn.PoolingClientConnectionManager] Connection released: [id: 0][route: {s}->https://ec2.ap-southeast-1.amazonaws.com][total kept alive: 1; route allocated: 1 of 50; total allocated: 1 of 50]
[2016-05-18 08:35:50,346][DEBUG][discovery.ec2 ] [Solomon O'Sullivan] using dynamic discovery nodes [{#cloud-i-0ae3b3f8d82e96b2c-0}{10.0.128.32}{10.0.128.32:9300}, {#cloud-i-f15fc37f-0}{10.0.128.126}{10.0.128.126:9300}]
[2016-05-18 08:35:50,359][DEBUG][transport.netty ] [Solomon O'Sullivan] connected to node [{#zen_unicast_52_#cloud-i-f15fc37f-0#}{10.0.128.126}{10.0.128.126:9300}]
[2016-05-18 08:35:50,360][DEBUG][discovery.ec2 ] [Solomon O'Sullivan] filtered ping responses: (filter_client[true], filter_data[false])
--> ping_response{node [{Anelle}{yzdmJgzMS9WHZCd5yyN7zQ}{10.0.128.126}{10.0.128.126:9300}{master=true}], id[54], master [{Anelle}{yzdmJgzMS9WHZCd5yyN7zQ}{10.0.128.126}{10.0.128.126:9300}{master=true}], hasJoinedOnce [true], cluster_name[xyz]}
[2016-05-18 08:35:50,360][DEBUG][transport.netty ] [Solomon O'Sullivan] disconnecting from [{#zen_unicast_52_#cloud-i-f15fc37f-0#}{10.0.128.126}{10.0.128.126:9300}] due to explicit disconnect call
[2016-05-18 08:35:50,361][DEBUG][transport.netty ] [Solomon O'Sullivan] disconnecting from [{#zen_unicast_1#}{127.0.0.1}{127.0.0.1:9300}] due to explicit disconnect call
[2016-05-18 08:35:50,361][DEBUG][transport.netty ] [Solomon O'Sullivan] disconnecting from [{#zen_unicast_51_#cloud-i-f15fc37f-0#}{10.0.128.126}{10.0.128.126:9300}] due to explicit disconnect call
[2016-05-18 08:35:50,362][DEBUG][transport.netty ] [Solomon O'Sullivan] disconnecting from [{#zen_unicast_50_#cloud-i-f15fc37f-0#}{10.0.128.126}{10.0.128.126:9300}] due to explicit disconnect call
[2016-05-18 08:35:50,362][DEBUG][transport.netty ] [Solomon O'Sullivan] disconnecting from [{#zen_unicast_6#}{::1}{[::1]:9300}] due to explicit disconnect call
[2016-05-18 08:35:50,362][DEBUG][discovery.zen.publish ] [Solomon O'Sullivan] received diff for but don't have any local cluster state - requesting full state
[2016-05-18 08:35:50,367][DEBUG][cluster.service ] [Solomon O'Sullivan] processing [finalize_join ({Anelle}{yzdmJgzMS9WHZCd5yyN7zQ}{10.0.128.126}{10.0.128.126:9300}{master=true})]: execute
[2016-05-18 08:35:50,367][DEBUG][discovery.ec2 ] [Solomon O'Sullivan] no master node is set, despite of join request completing. retrying pings.
[2016-05-18 08:35:50,367][DEBUG][cluster.service ] [Solomon O'Sullivan] processing [finalize_join ({Anelle}{yzdmJgzMS9WHZCd5yyN7zQ}{10.0.128.126}{10.0.128.126:9300}{master=true})]: took 0s no change in cluster_state
[2016-05-18 08:35:50,371][DEBUG][transport.netty ] [Solomon O'Sullivan] connected to node [{#zen_unicast_6#}{::1}{[::1]:9300}]
[2016-05-18 08:35:50,371][DEBUG][transport.netty ] [Solomon O'Sullivan] connected to node [{#zen_unicast_1#}{127.0.0.1}{127.0.0.1:9300}]
[2016-05-18 08:35:50,373][DEBUG][transport.netty ] [Solomon O'Sullivan] connected to node [{#zen_unicast_53_#cloud-i-f15fc37f-0#}{10.0.128.126}{10.0.128.126:9300}]
[2016-05-18 08:35:51,874][DEBUG][transport.netty ] [Solomon O'Sullivan] connected to node [{#zen_unicast_54_#cloud-i-f15fc37f-0#}{10.0.128.126}{10.0.128.126:9300}]

restarting the nodes also doesn't help

I remember sometimes that resetting the no of replicas used to fix it randomly...now even that is not working ..

[root@ip-10-0-128-126 ~]# curl localhost:9200/_cat/shards
xyz_config_v2.3.11_20160428125912 4 p STARTED 0 130b 10.0.128.126 Anelle
xyz_config_v2.3.11_20160428125912 4 r UNASSIGNED
xyz_config_v2.3.11_20160428125912 2 r INITIALIZING 10.0.128.32 Solomon O'Sullivan
xyz_config_v2.3.11_20160428125912 2 p STARTED 1 5.1kb 10.0.128.126 Anelle
xyz_config_v2.3.11_20160428125912 1 p STARTED 0 130b 10.0.128.126 Anelle
xyz_config_v2.3.11_20160428125912 1 r UNASSIGNED
xyz_config_v2.3.11_20160428125912 3 r INITIALIZING 10.0.128.32 Solomon O'Sullivan
xyz_config_v2.3.11_20160428125912 3 p STARTED 3 5.3kb 10.0.128.126 Anelle
xyz_config_v2.3.11_20160428125912 0 p STARTED 6285 2.5mb 10.0.128.126 Anelle
xyz_config_v2.3.11_20160428125912 0 r UNASSIGNED
kb4uoc712u0mkj9l 4 p STARTED 0 130b 10.0.128.126 Anelle
kb4uoc712u0mkj9l 4 r UNASSIGNED
kb4uoc712u0mkj9l 2 p STARTED 0 130b 10.0.128.126 Anelle
kb4uoc712u0mkj9l 2 r UNASSIGNED
kb4uoc712u0mkj9l 1 p STARTED 0 130b 10.0.128.126 Anelle
kb4uoc712u0mkj9l 1 r UNASSIGNED
kb4uoc712u0mkj9l 3 p STARTED 0 130b 10.0.128.126 Anelle
kb4uoc712u0mkj9l 3 r UNASSIGNED
kb4uoc712u0mkj9l 0 p STARTED 0 130b 10.0.128.126 Anelle
kb4uoc712u0mkj9l 0 r UNASSIGNED
mmv1ussb1asslmno 4 p STARTED 16202 9.7mb 10.0.128.126 Anelle
mmv1ussb1asslmno 4 r UNASSIGNED
mmv1ussb1asslmno 2 p STARTED 16315 9.7mb 10.0.128.126 Anelle
mmv1ussb1asslmno 2 r UNASSIGNED
mmv1ussb1asslmno 1 p STARTED 16156 9.7mb 10.0.128.126 Anelle
mmv1ussb1asslmno 1 r UNASSIGNED
mmv1ussb1asslmno 3 p STARTED 16250 9.7mb 10.0.128.126 Anelle
mmv1ussb1asslmno 3 r UNASSIGNED
mmv1ussb1asslmno 0 p STARTED 15702 9.5mb 10.0.128.126 Anelle
mmv1ussb1asslmno 0 r UNASSIGNED
sszkjul62_macpo3 4 p STARTED 225280 111.3mb 10.0.128.126 Anelle
sszkjul62_macpo3 4 r UNASSIGNED
sszkjul62_macpo3 2 p STARTED 225362 110.8mb 10.0.128.126 Anelle
sszkjul62_macpo3 2 r UNASSIGNED
sszkjul62_macpo3 1 p STARTED 224192 101.5mb 10.0.128.126 Anelle
sszkjul62_macpo3 1 r UNASSIGNED
sszkjul62_macpo3 3 p STARTED 224549 104.1mb 10.0.128.126 Anelle
sszkjul62_macpo3 3 r UNASSIGNED
sszkjul62_macpo3 0 p STARTED 224553 103.3mb 10.0.128.126 Anelle
sszkjul62_macpo3 0 r UNASSIGNED

as it turns out its because of this https://github.com/elastic/elasticsearch/issues/13445

Master node had licence plugin where as the new node didn't.

Better logging would have been nice.