Cluster not formed

Usually, all works well. The hosts IPs and ports are configured properly and were not changed.
Why didn't the cluster form this time?


Elastic-1


[2017-12-19T13:03:18,058][INFO ][o.e.d.DiscoveryModule    ] [node-1] using discovery type [zen]
[2017-12-19T13:03:18,417][INFO ][o.e.n.Node               ] [node-1] initialized
[2017-12-19T13:03:18,417][INFO ][o.e.n.Node               ] [node-1] starting ...
[2017-12-19T13:03:18,539][INFO ][o.e.t.TransportService   ] [node-1] publish_address {172.16.65.114:9300}, bound_addresses {172.16.65.114:9300}
[2017-12-19T13:03:18,543][INFO ][o.e.b.BootstrapChecks    ] [node-1] bound or publishing to a non-loopback or non-link-local address, enforcing bootstrap checks
[2017-12-19T13:03:48,556][WARN ][o.e.n.Node               ] [node-1] timed out while waiting for initial discovery state - timeout: 30s
[2017-12-19T13:03:48,563][INFO ][o.e.h.n.Netty4HttpServerTransport] [node-1] publish_address {172.16.65.114:9200}, bound_addresses {172.16.65.114:9200}
[2017-12-19T13:03:48,565][INFO ][o.e.n.Node               ] [node-1] started



Elastic-2

[2017-12-19T13:03:20,525][INFO ][o.e.t.TransportService   ] [node-2] publish_address {172.16.65.117:9300}, bound_addresses {172.16.65.117:9300}
[2017-12-19T13:03:20,530][INFO ][o.e.b.BootstrapChecks    ] [node-2] bound or publishing to a non-loopback or non-link-local address, enforcing bootstrap checks
[2017-12-19T13:03:25,462][INFO ][o.e.c.s.ClusterService   ] [node-2] detected_master {node-3}{7KuCh13YQ4Knusm_wE0oCg}{Dq4AP-L2TyWPOdrxLaLtHg}{172.16.67.71}{172.16.67.71:9300}, added {{node-3}{7KuCh13YQ4Knusm_wE0oCg}{Dq4AP-L2TyWPOdrxLaLtHg}{172.16.67.71}{172.16.67.71:9300},}, reason: zen-disco-receive(from master [master {node-3}{7KuCh13YQ4Knusm_wE0oCg}{Dq4AP-L2TyWPOdrxLaLtHg}{172.16.67.71}{172.16.67.71:9300} committed version [1]])
[2017-12-19T13:03:25,469][INFO ][o.e.h.n.Netty4HttpServerTransport] [node-2] publish_address {172.16.65.117:9200}, bound_addresses {172.16.65.117:9200}
[2017-12-19T13:03:25,472][INFO ][o.e.n.Node               ] [node-2] started

Elastic-3

[2017-12-19T13:03:22,310][INFO ][o.e.n.Node               ] [node-3] starting ...
[2017-12-19T13:03:22,419][INFO ][o.e.t.TransportService   ] [node-3] publish_address {172.16.67.71:9300}, bound_addresses {172.16.67.71:9300}
[2017-12-19T13:03:22,424][INFO ][o.e.b.BootstrapChecks    ] [node-3] bound or publishing to a non-loopback or non-link-local address, enforcing bootstrap checks
[2017-12-19T13:03:25,454][INFO ][o.e.c.s.ClusterService   ] [node-3] new_master {node-3}{7KuCh13YQ4Knusm_wE0oCg}{Dq4AP-L2TyWPOdrxLaLtHg}{172.16.67.71}{172.16.67.71:9300}, added {{node-2}{aitC5uqmRtKedcr3YT4AQw}{yM7opvE7Siu7HdmpXTciNw}{172.16.65.117}{172.16.65.117:9300},}, reason: zen-disco-elected-as-master ([1] nodes joined)[{node-2}{aitC5uqmRtKedcr3YT4AQw}{yM7opvE7Siu7HdmpXTciNw}{172.16.65.117}{172.16.65.117:9300}]
[2017-12-19T13:03:25,474][INFO ][o.e.h.n.Netty4HttpServerTransport] [node-3] publish_address {172.16.67.71:9200}, bound_addresses {172.16.67.71:9200}
[2017-12-19T13:03:25,476][INFO ][o.e.n.Node               ] [node-3] started
[2017-12-19T13:03:25,729][INFO ][o.e.g.GatewayService     ] [node-3] recovered [1] indices into cluster_state
[2017-12-19T13:03:26,040][INFO ][o.e.c.r.a.AllocationService] [node-3] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[events_1513680805055][0]] ...]).
[2017-12-19T13:03:38,787][WARN ][o.e.c.a.s.ShardStateAction] [node-3] [events_1513680805055][0] received shard failed for shard id [[events_1513680805055][0]], allocation id [CwQRlRIOQzaNbB4bnc4Ydg], primary term [16], message [mark copy as stale]
[2017-12-19T13:03:53,879][WARN ][o.e.d.z.ZenDiscovery     ] [node-3] not enough master nodes (has [1], but needed [2]), current nodes: nodes: 
   {node-2}{aitC5uqmRtKedcr3YT4AQw}{yM7opvE7Siu7HdmpXTciNw}{172.16.65.117}{172.16.65.117:9300}
   {node-3}{7KuCh13YQ4Knusm_wE0oCg}{Dq4AP-L2TyWPOdrxLaLtHg}{172.16.67.71}{172.16.67.71:9300}, local, master

My transport client tried to connect the cluster and perform write requests, got:


org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];

Elasticsearch 5.4

And why 1 Elasticsearch started without minimum master nodes that is configured as 2?

.yml:

discovery.zen.minimum_master_nodes: 2
discovery.zen.commit_timeout: 2s
discovery.zen.publish_timeout: 2s
discovery.zen.fd.ping_timeout: 1s
transport.tcp.connect_timeout: 1s

Apparently Node 1 can not be seen by the others.

You started Node2. It waited until Node3 joined. They formed the cluster.

But Node1 is still not there.
I'd suggest to restart Node1?

Why is my client not working? I have 2 nodes as you said but it connects only 1 I guess...
I want it to fail...
I am using


client.addTransportAddress(new InetSocketTransportAddress(address, EventsConstants.ES_PORT));

3 times for the three addresses.
If it cannot connect all 3 I want it to fail.
In addition, why node-1 was not visible? you suggest network issue?
And why node-1 didn't shutdown when he didn't have enough master nodes?

you suggest network issue?

Yes. Might be.

And why node-1 didn't shutdown when he didn't have enough master nodes?

He is waiting for enough master nodes to come online. Once he can find them, it will join the cluster.

So why a client configured to connect to 3 nodes isn't throwing any exception in this kind of scenario?
what API can I use in order to know which nodes are in which cluster?
Any timeout for the node to understand that he can't see his "friends" in the cluster? And understand that he his alone and should go down?

So why a client configured to connect to 3 nodes isn't throwing any exception in this kind of scenario?

That's a valid point. The node is available but can't really deal with requests as it is not part of the cluster. Which is not a correct behavior. I mean that when you have a node like this you should fix the problem ASAP before using any client.

what API can I use in order to know which nodes are in which cluster?

Probably something like Nodes Info?

Any timeout for the node to understand that he can't see his "friends" in the cluster?

As far as I remember, it's 30s by default before the "WARN" is printed.

And understand that he his alone and should go down?

That will never happen. If you want to shut it down, that must be something you control.
Because, if it's a network issue, the network can come back again and the node can join the cluster again.

Thank you, I think that I will try using nodes info API and handle it from there.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.