ClusterBlockException after patching ES to 2.3.3 from 2.3.2

Patched a rpm based ES 2.3.2 cluster to 2.3.3 last friday, since we're seen multiple events like this:

[2016-05-20 09:23:13,302][INFO ][node ] [d1r1n1] started
[2016-05-20 09:23:26,212][DEBUG][action.admin.indices.create] [d1r1n1] no known master node, scheduling a retry
[2016-05-20 09:23:56,783][DEBUG][action.admin.indices.create] [d1r1n1] no known master node, scheduling a retry
[2016-05-20 09:24:11,558][DEBUG][action.admin.indices.create] [d1r1n1] no known master node, scheduling a retry
[2016-05-20 09:24:26,213][DEBUG][action.admin.indices.create] [d1r1n1] timed out while retrying [indices:admin/create] after failure (timeout [1m])
[2016-05-20 09:24:26,216][WARN ][rest.suppressed ] /_bulk Params: {}
ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];[SERVICE_UNAVAILABLE/2/no master];]
at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:154)
at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedRaiseException(ClusterBlocks.java:144)
at org.elasticsearch.action.bulk.TransportBulkAction.executeBulk(TransportBulkAction.java:212)
at org.elasticsearch.action.bulk.TransportBulkAction.access$000(TransportBulkAction.java:71)
at org.elasticsearch.action.bulk.TransportBulkAction$1.onFailure(TransportBulkAction.java:150)
at org.elasticsearch.action.support.TransportAction$1.onFailure(TransportAction.java:95)
at org.elasticsearch.action.support.ThreadedActionListener$2.doRun(ThreadedActionListener.java:104)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)

tried stopping all nodes and restarting them again,
but cluster only semi seems to work, (data in and out), hints appreciated on howto investigate this further, TIA

After stopping all and restarting again we have:

> [root@d1r1n1 ~]# curl -XGET "http://`hostname`:9200/_cat/nodes"
> <redacted>.170 <redacted>.170 14 99 0.18 d * d1r1n6
> <redacted>.176 <redacted>.176  4 99 0.14 d m d1r1n12
> <redacted>.168 <redacted>.168 10 98 0.67 d m d1r1n4
> <redacted>.165 <redacted>.165  8 99 0.19 d m d1r1n1
> <redacted>.178 <redacted>.178 13 99 0.38 d m d1r1n14
> <redacted>.169 <redacted>.169 18 93 0.41 d m d1r1n5
> <redacted>.175 <redacted>.175  8 99 0.04 d m d1r1n11
> <redacted>.183 <redacted>.183  4 47 0.12 c - kibana/perf
> <redacted>.177 <redacted>.177  7 99 0.45 d m d1r1n13
> <redacted>.171 <redacted>.171 13 99 0.19 d m d1r1n7
> <redacted>.172 <redacted>.172 10 99 0.33 d m d1r1n8
> <redacted>.166 <redacted>.166  8 99 0.24 d m d1r1n2
> <redacted>.174 <redacted>.174 10 99 0.29 d m d1r1n10
> <redacted>.167 <redacted>.167  8 97 0.47 d m d1r1n3

Admittedly also patched another cluster from 2.3.2 to 2.3.3 with out issues, so it's most properly not related to patch level jump, properly more to the fact that this broken cluster had a major network renumbering a week ago (network was broken down while nodes were up and running) though nodes seemed okay again after renumbering was over.

Right now, it seems only one node fails to properly join cluster, d1r1n9 as only one says after restart this:

> [2016-05-23 15:13:46,721][INFO ][node                     ] [d1r1n9] version[2.3.3], pid[5438], build[218bdf1/2016-05-17T15:40:04Z]
> [2016-05-23 15:13:46,721][INFO ][node                     ] [d1r1n9] initializing ...
> [2016-05-23 15:13:47,131][INFO ][plugins                  ] [d1r1n9] modules [reindex, lang-expression, lang-groovy], plugins [head, kopf, hq], sites [kopf, hq, head]
> [2016-05-23 15:13:47,145][INFO ][env                      ] [d1r1n9] using [1] data paths, mounts [[/mxes_data/1 (/dev/mapper/vg--blob1-lv--mxes)]], net usable_space [186.2gb], net total_space [199.9gb], spins? [possibly], types [xfs]
> [2016-05-23 15:13:47,145][INFO ][env                      ] [d1r1n9] heap size [7.8gb], compressed ordinary object pointers [true]
> [2016-05-23 15:13:47,145][WARN ][env                      ] [d1r1n9] max file descriptors [65535] for elasticsearch process likely too low, consider increasing to at least [65536]
> [2016-05-23 15:13:48,807][INFO ][node                     ] [d1r1n9] initialized
> [2016-05-23 15:13:48,807][INFO ][node                     ] [d1r1n9] starting ...
> [2016-05-23 15:13:48,898][INFO ][transport                ] [d1r1n9] publish_address {<redacted>.173:9300}, bound_addresses {<redacted>.173:9300}
> [2016-05-23 15:13:48,901][INFO ][discovery                ] [d1r1n9] mxes_data/zkaI-nYmQNS9JS1cziE08w
> [2016-05-23 15:14:18,903][WARN ][discovery                ] [d1r1n9] waited for 30s and no initial state was set by the discovery
> [2016-05-23 15:14:18,913][INFO ][http                     ] [d1r1n9] publish_address {<redacted>.173:9200}, bound_addresses {<redacted>.173:9200}
> [2016-05-23 15:14:18,913][INFO ][node                     ] [d1r1n9] started
> [2016-05-23 15:14:39,986][DEBUG][action.admin.indices.create] [d1r1n9] no known master node, scheduling a retry
> [2016-05-23 15:14:42,316][DEBUG][action.admin.indices.create] [d1r1n9] no known master node, scheduling a retry
> [2016-05-23 15:14:49,546][DEBUG][action.admin.indices.create] [d1r1n9] no known master node, scheduling a retry
> [2016-05-23 15:15:36,298][DEBUG][action.admin.indices.create] [d1r1n9] no known master node, scheduling a retry
> [2016-05-23 15:15:39,069][DEBUG][action.admin.indices.create] [d1r1n9] no known master node, scheduling a retry
> [2016-05-23 15:15:39,988][DEBUG][action.admin.indices.create] [d1r1n9] timed out while retrying [indices:admin/create] after failure (timeout [1m])
> [2016-05-23 15:15:39,990][WARN ][rest.suppressed          ] /_bulk Params: {}
> ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];[SERVICE_UNAVAILABLE/2/no master];]
>         at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:154)
>         at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedRaiseException(ClusterBlocks.java:144)
>         at org.elasticsearch.action.bulk.TransportBulkAction.executeBulk(TransportBulkAction.java:212)
>         at org.elasticsearch.action.bulk.TransportBulkAction.access$000(TransportBulkAction.java:71)
>         at org.elasticsearch.action.bulk.TransportBulkAction$1.onFailure(TransportBulkAction.java:150)
>         at org.elasticsearch.action.support.TransportAction$1.onFailure(TransportAction.java:95)
>         at org.elasticsearch.action.support.ThreadedActionListener$2.doRun(ThreadedActionListener.java:104)
>         at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>         at java.lang.Thread.run(Unknown Source)
> [2016-05-23 15:15:42,316][DEBUG][action.admin.indices.create] [d1r1n9] timed out while retrying [indices:admin/create] after failure (timeout [1m])
> [2016-05-23 15:15:42,317][WARN ][rest.suppressed          ] /_bulk Params: {}

repeating the last same few events about every minut. I can connect from this node to the other nodes on both port 9300 & 9200, so why won't this find it's master...

Stupid me, my mistake :blush:

Turns out after the renumbering of one network, nodes couldn't resolve short hostnames any longer for this network where to connect to other nodes. Correcting elasticsearch.yml to use resolvable hostnames (fqdn) in discovery.zen.ping.unicast.hosts again made things back to normal :relieved:

Weirdly enough somehow nodes connected sometimes... even though discovery.zen.ping.multicast.enabled equals false