All shards unavailable after one master of the three left, ES 5.4

Hello everybody.

We has ES 5.4 Cluster on Centos7 In the configuration: 3 master Node and 11 data node (14 nodes, 44 indices, 956 shards). Instances are running multiple elasticsearch nodes on 7 machine (Two Instance per host)

Configuration discovery on all nodes:
discovery.zen.minimum_master_nodes: 2
discovery.zen.no_master_block: all

We needed to reboot one master node to make changes to the configuration file.
After restart Active master node we saw that all the shards on all the nodes went into status unavailable, and Cluster status went into status RED. All shards turn into status "master marked shard as active, but shard has not been created, mark shard as failed"

We want to understand why this happens? As far as we understand, setting discovery.zen.minimum_master_nodes: 2 - tells us that if there are 2 master, the cluster will be available - and all shards are available.
For what reason all available shards go into status unavailable?

Part of the shard recovered after a while, the rest of the shards were restored by the command:
curl -XPOST 'hdp-01-master:9200/_cluster/reroute?retry_failed

The contents of the log files

Log active master file:

[2017-06-13T10:45:40,661][INFO ][o.e.n.Node ] [hdp-01-master] stopping ...
[2017-06-13T10:45:40,690][INFO ][o.e.n.Node ] [hdp-01-master] stopped
[2017-06-13T10:45:40,690][INFO ][o.e.n.Node ] [hdp-01-master] closing ...
[2017-06-13T10:45:40,707][INFO ][o.e.n.Node ] [hdp-01-master] closed
[2017-06-13T10:47:37,016][INFO ][o.e.n.Node ] [hdp-01-master] initializing ...
[2017-06-13T10:47:37,113][INFO ][o.e.e.NodeEnvironment ] [hdp-01-master] using [1] data paths, mounts [[/elasticsearch (/dev/mapper/mpatha1)]], net usable_space [3.9tb], net total_space [4.3tb], spins? [possibly], types [ext4]
[2017-06-13T10:47:37,113][INFO ][o.e.e.NodeEnvironment ] [hdp-01-master] heap size [30.7gb], compressed ordinary object pointers [true]
[2017-06-13T10:47:37,189][INFO ][o.e.n.Node ] [hdp-01-master] node name [hdp-01-master], node ID [8YNfaQMsSoOdfuFmiic5NQ]
[2017-06-13T10:47:37,189][INFO ][o.e.n.Node ] [hdp-01-master] version[5.4.0], pid[12036], build[780f8c4/2017-04-28T17:43:27.229Z], OS[Linux/3.10.0-514.21.1.el7.x86_64/amd64], JVM[Oracle Corporation/OpenJDK 64-Bit Server VM/1.8.0_131/25.131-b12]
[2017-06-13T10:47:38,017][INFO ][o.e.p.PluginsService ] [hdp-01-master] loaded module [aggs-matrix-stats]
[2017-06-13T10:47:38,017][INFO ][o.e.p.PluginsService ] [hdp-01-master] loaded module [ingest-common]
[2017-06-13T10:47:38,017][INFO ][o.e.p.PluginsService ] [hdp-01-master] loaded module [lang-expression]
[2017-06-13T10:47:38,017][INFO ][o.e.p.PluginsService ] [hdp-01-master] loaded module [lang-groovy]
[2017-06-13T10:47:38,017][INFO ][o.e.p.PluginsService ] [hdp-01-master] loaded module [lang-mustache]
[2017-06-13T10:47:38,018][INFO ][o.e.p.PluginsService ] [hdp-01-master] loaded module [lang-painless]
[2017-06-13T10:47:38,018][INFO ][o.e.p.PluginsService ] [hdp-01-master] loaded module [percolator]
[2017-06-13T10:47:38,018][INFO ][o.e.p.PluginsService ] [hdp-01-master] loaded module [reindex]
[2017-06-13T10:47:38,018][INFO ][o.e.p.PluginsService ] [hdp-01-master] loaded module [transport-netty3]
[2017-06-13T10:47:38,018][INFO ][o.e.p.PluginsService ] [hdp-01-master] loaded module [transport-netty4]
[2017-06-13T10:47:38,019][INFO ][o.e.p.PluginsService ] [hdp-01-master] no plugins loaded
[2017-06-13T10:47:43,155][INFO ][o.e.d.DiscoveryModule ] [hdp-01-master] using discovery type [zen]
[2017-06-13T10:47:44,500][INFO ][o.e.n.Node ] [hdp-01-master] initialized
[2017-06-13T10:47:44,500][INFO ][o.e.n.Node ] [hdp-01-master] starting ...
[2017-06-13T10:47:44,678][INFO ][o.e.t.TransportService ] [hdp-01-master] publish_address {10.7.10.212:9300}, bound_addresses {10.7.10.212:9300}, {[::1]:9300}, {127.0.0.1:9300}, {[fe80::42f2:e9ff:fec3:6358]:9300}
[2017-06-13T10:47:44,685][INFO ][o.e.b.BootstrapChecks ] [hdp-01-master] bound or publishing to a non-loopback or non-link-local address, enforcing bootstrap checks
[2017-06-13T10:47:58,971][INFO ][o.e.c.s.ClusterService ] [hdp-01-master] detected_master {hdp-02-master}{pxnoVS7JThaFx97fZ7CP3g}{flTcV3ijQsaf1tZ_t2-TpQ}{10.7.10.214}{10.7.10.214:9300}, added {{hdp-03-data-01}{reY2QtW7SLGqRrBWBzIzxQ}{5FdcgCdsSqeYPsoeN
scMTg}{10.7.10.216}{10.7.10.216:9301},{hdp-06-data-01}{7xt5lYsPQIC1q7xZu1dWBQ}{N8DvE1lCSKyh_7emtJblzA}{10.7.10.219}{10.7.10.219:9301},{hdp-04-data-01}{1twQnDaYQc6d3rNXROd1SA}{rtFhFHqjQhGToQKKIhTkVg}{10.7.10.217}{10.7.10.217:9301},{hdp-05-data-02}{_b
ppTnQ6SC6YjOzY0TvVBw}{i6Nl_GviRXK-aOem-Tim8g}{10.7.10.218}{10.7.10.218:9302},{hdp-07-data-01}{w91AB0WcQcuVLfslgL1Urw}{qLnVUwRQRSWJ3JCpvdLxrQ}{10.7.10.220}{10.7.10.220:9301},{hdp-06-data-02}{2BJg1e3eRgmf32l8XPGGIA}{P2AUqUY_S3eP7dRPAFN-Sg}{10.7.10.219}{10.4
.108.219:9302},{hdp-02-master}{pxnoVS7JThaFx97fZ7CP3g}{flTcV3ijQsaf1tZ_t2-TpQ}{10.7.10.214}{10.7.10.214:9300},{hdp-04-data-02}{mWFL6GKOS4yrPIctN-xbjw}{2kEW8a2_SFaW9Ysp6PRhiw}{10.7.10.217}{10.7.10.217:9302},{hdp-07-data-02}{ROH4DptMThCatsgr0QMtrQ}{wcVz
VTQjRGi6QDQy_ywp1A}{10.7.10.220}{10.7.10.220:9302},{hdp-02-data-01}{hjKZgaWwTwmjJQvWifkCtQ}{K4u8608QSUi8Rh0PzL4Y-A}{10.7.10.214}{10.7.10.214:9301},{hdp-01-data-01}{15gD8dOhSW267pERz_tDxw}{z3ZEAibTRd67dXGjFEx5Gw}{10.7.10.212}{10.7.10.212:9301},{hdp-0
3-master}{vvgvw0pUQau1Wo_GwPqCTw}{HxQzGLEQSDWXXSOQAq2A2Q}{10.7.10.216}{10.7.10.216:9300},{hdp-05-data-01}{WwypVa1qQpGsB_FyNuSUTg}{H-a9nDgoQCeYeYuNBdy4PQ}{10.7.10.218}{10.7.10.218:9301},}, reason: zen-disco-receive(from master [master {hdp-02-master}{pxnoVS
7JThaFx97fZ7CP3g}{flTcV3ijQsaf1tZ_t2-TpQ}{10.7.10.214}{10.7.10.214:9300} committed version [1120]])
[2017-06-13T10:47:59,137][INFO ][o.e.h.n.Netty4HttpServerTransport] [hdp-01-master] publish_address {10.7.10.212:9200}, bound_addresses {10.7.10.212:9200}, {[::1]:9200}, {127.0.0.1:9200}, {[fe80::42f2:e9ff:fec3:6358]:9200}
[2017-06-13T10:47:59,142][INFO ][o.e.n.Node ] [hdp-01-master] started
[2017-06-13T10:47:59,456][DEBUG][o.e.a.s.TransportSearchAction] [hdp-01-master] All shards failed for phase: [query]
[2017-06-13T10:47:59,456][DEBUG][o.e.a.s.TransportSearchAction] [hdp-01-master] All shards failed for phase: [query]
[2017-06-13T10:47:59,456][DEBUG][o.e.a.s.TransportSearchAction] [hdp-01-master] All shards failed for phase: [query]
[2017-06-13T10:47:59,458][WARN ][r.suppressed ] path: /vk_20170613/_search, params: {scroll=10m, index=vk_20170613}
org.elasticsearch.action.search.SearchPhaseExecutionException: all shards failed

Log standby master file:

> [2017-06-13T10:45:40,661][INFO ][o.e.d.z.ZenDiscovery ] [hdp-02-master] master_left [{hdp-01-master}{8YNfaQMsSoOdfuFmiic5NQ}{T7jwtU_bScyJ_tXs2aHojw}{10.7.10.212}{10.7.10.212:9300}], reason [shut_down]

[2017-06-13T10:45:40,664][WARN ][o.e.d.z.ZenDiscovery ] [hdp-02-master] master left (reason = shut_down), current nodes: nodes:
{hdp-06-data-01}{7xt5lYsPQIC1q7xZu1dWBQ}{N8DvE1lCSKyh_7emtJblzA}{10.7.10.219}{10.7.10.219:9301}
{hdp-04-data-01}{1twQnDaYQc6d3rNXROd1SA}{rtFhFHqjQhGToQKKIhTkVg}{10.7.10.217}{10.7.10.217:9301}
{hdp-05-data-02}{_bppTnQ6SC6YjOzY0TvVBw}{i6Nl_GviRXK-aOem-Tim8g}{10.7.10.218}{10.7.10.218:9302}
{hdp-07-data-01}{w91AB0WcQcuVLfslgL1Urw}{qLnVUwRQRSWJ3JCpvdLxrQ}{10.7.10.220}{10.7.10.220:9301}
{hdp-04-data-02}{mWFL6GKOS4yrPIctN-xbjw}{2kEW8a2_SFaW9Ysp6PRhiw}{10.7.10.217}{10.7.10.217:9302}
{hdp-01-master}{8YNfaQMsSoOdfuFmiic5NQ}{T7jwtU_bScyJ_tXs2aHojw}{10.7.10.212}{10.7.10.212:9300}, master
{hdp-05-data-01}{WwypVa1qQpGsB_FyNuSUTg}{H-a9nDgoQCeYeYuNBdy4PQ}{10.7.10.218}{10.7.10.218:9301}
{hdp-02-data-01}{hjKZgaWwTwmjJQvWifkCtQ}{K4u8608QSUi8Rh0PzL4Y-A}{10.7.10.214}{10.7.10.214:9301}
{hdp-07-data-02}{ROH4DptMThCatsgr0QMtrQ}{wcVzVTQjRGi6QDQy_ywp1A}{10.7.10.220}{10.7.10.220:9302}
{hdp-01-data-01}{15gD8dOhSW267pERz_tDxw}{z3ZEAibTRd67dXGjFEx5Gw}{10.7.10.212}{10.7.10.212:9301}
{hdp-03-data-01}{reY2QtW7SLGqRrBWBzIzxQ}{5FdcgCdsSqeYPsoeNscMTg}{10.7.10.216}{10.7.10.216:9301}
{hdp-06-data-02}{2BJg1e3eRgmf32l8XPGGIA}{P2AUqUY_S3eP7dRPAFN-Sg}{10.7.10.219}{10.7.10.219:9302}
{hdp-03-master}{vvgvw0pUQau1Wo_GwPqCTw}{HxQzGLEQSDWXXSOQAq2A2Q}{10.7.10.216}{10.7.10.216:9300}
{hdp-02-master}{pxnoVS7JThaFx97fZ7CP3g}{flTcV3ijQsaf1tZ_t2-TpQ}{10.7.10.214}{10.7.10.214:9300}, local
[2017-06-13T10:45:43,673][WARN ][o.e.d.z.ZenDiscovery ] [hdp-02-master] failed to connect to master [{hdp-01-master}{8YNfaQMsSoOdfuFmiic5NQ}{T7jwtU_bScyJ_tXs2aHojw}{10.7.10.212}{10.7.10.212:9300}], retrying...
org.elasticsearch.transport.ConnectTransportException: [hdp-01-master][10.7.10.212:9300] connect_timeout[30s]
at org.elasticsearch.transport.netty4.Netty4Transport.connectToChannels(Netty4Transport.java:359) ~[?:?]
at org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:526) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.transport.TcpTransport.connectToNode(TcpTransport.java:465) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:315) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:302) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.discovery.zen.ZenDiscovery.joinElectedMaster(ZenDiscovery.java:468) [elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.discovery.zen.ZenDiscovery.innerJoinCluster(ZenDiscovery.java:420) [elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.discovery.zen.ZenDiscovery.access$4100(ZenDiscovery.java:83) [elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.discovery.zen.ZenDiscovery$JoinThreadControl$1.run(ZenDiscovery.java:1197) [elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) [elasticsearch-5.4.0.jar:5.4.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 10.7.10.212/10.7.10.212:9300
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:?]
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) ~[?:?]
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:352) ~[?:?]
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
[2017-06-13T10:45:46,755][INFO ][o.e.c.s.ClusterService ] [hdp-02-master] new_master {hdp-02-master}{pxnoVS7JThaFx97fZ7CP3g}{flTcV3ijQsaf1tZ_t2-TpQ}{10.7.10.214}{10.7.10.214:9300}, reason: zen-disco-elected-as-master ([1] nodes joined)[{hdp-03-master}
{vvgvw0pUQau1Wo_GwPqCTw}{HxQzGLEQSDWXXSOQAq2A2Q}{10.7.10.216}{10.7.10.216:9300}]
[2017-06-13T10:45:46,769][WARN ][o.e.c.a.s.ShardStateAction] [hdp-02-master] [vk_20170607_merge][1] received shard failed for shard id [[vk_20170607_merge][1]], allocation id [zk-Ez1c7QZmqhbJFeV8BUw], primary term [0], message [master marked shard as active, but shard has not been created, mark shard as failed]
[2017-06-13T10:45:46,769][WARN ][o.e.c.a.s.ShardStateAction] [hdp-02-master] [vk_20170607_merge][0] received shard failed for shard id [[vk_20170607_merge][0]], allocation id [XaZTO1rEQlCLdwCrXbJzxw], primary term [0], message [master marked shard as active, but shard has not been created, mark shard as failed]
[2017-06-13T10:45:46,769][WARN ][o.e.c.a.s.ShardStateAction] [hdp-02-master] [buzz_20170607_merge][6] received shard failed for shard id [[buzz_20170607_merge][6]], allocation id [rXf2lzH7QK2y8_eS9UC2hA], primary term [0], message [master marked shard as active, but shard has not been created, mark shard as failed]
[2017-06-13T10:45:46,769][WARN ][o.e.c.a.s.ShardStateAction] [hdp-02-master] [vk_20170604_merge][3] received shard failed for shard id [[vk_20170604_merge][3]], allocation id [icTS5DpwSem22HNApST6cg], primary term [0], message [master marked shard as active, but shard has not been created, mark shard as failed]
[2017-06-13T10:45:46,770][WARN ][o.e.c.a.s.ShardStateAction] [hdp-02-master] [vk_20170611][7] received shard failed for shard id [[vk_20170611][7]], allocation id [L_i6lzDoRnSdx3ZLIzo2-w], primary term [0], message [master marked shard as active, but shard has not been created, mark shard as failed]

Further in a log file there are records "received shard failed for shard id" about all shards of a cluster

no_master_block being all which means that once a master disappears then the cluster cannot do anything, like allocate.

Thanks for your reply!

The documentation says:
The discovery.zen.no_master_block setting all:
All operations on the node—i.e. both read & writes—will be rejected. This also applies for api cluster state read or write operations, like the get index settings, put mapping and cluster state api.

Does that mean that the allocate belongs to the "api cluster"?
Correctly I understand that if we switch this setting to the parameter "write" the cluster will not transition to the state RED when the active master is shut down (assuming that the other two are running)?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.