All shards unavailable after one master of the three left, ES 5.4

j0hnni3 · June 14, 2017, 9:59am

Hello everybody.

We has ES 5.4 Cluster on Centos7 In the configuration: 3 master Node and 11 data node (14 nodes, 44 indices, 956 shards). Instances are running multiple elasticsearch nodes on 7 machine (Two Instance per host)

Configuration discovery on all nodes:
discovery.zen.minimum_master_nodes: 2
discovery.zen.no_master_block: all

We needed to reboot one master node to make changes to the configuration file.
After restart Active master node we saw that all the shards on all the nodes went into status unavailable, and Cluster status went into status RED. All shards turn into status "master marked shard as active, but shard has not been created, mark shard as failed"

We want to understand why this happens? As far as we understand, setting discovery.zen.minimum_master_nodes: 2 - tells us that if there are 2 master, the cluster will be available - and all shards are available.
For what reason all available shards go into status unavailable?

Part of the shard recovered after a while, the rest of the shards were restored by the command:
curl -XPOST 'hdp-01-master:9200/_cluster/reroute?retry_failed

j0hnni3 · June 14, 2017, 9:59am

The contents of the log files

Log active master file:

[2017-06-13T10:45:40,661][INFO ][o.e.n.Node ] [hdp-01-master] stopping ...
[2017-06-13T10:45:40,690][INFO ][o.e.n.Node ] [hdp-01-master] stopped
[2017-06-13T10:45:40,690][INFO ][o.e.n.Node ] [hdp-01-master] closing ...
[2017-06-13T10:45:40,707][INFO ][o.e.n.Node ] [hdp-01-master] closed
[2017-06-13T10:47:37,016][INFO ][o.e.n.Node ] [hdp-01-master] initializing ...
[2017-06-13T10:47:37,113][INFO ][o.e.e.NodeEnvironment ] [hdp-01-master] using [1] data paths, mounts [[/elasticsearch (/dev/mapper/mpatha1)]], net usable_space [3.9tb], net total_space [4.3tb], spins? [possibly], types [ext4]
[2017-06-13T10:47:37,113][INFO ][o.e.e.NodeEnvironment ] [hdp-01-master] heap size [30.7gb], compressed ordinary object pointers [true]
[2017-06-13T10:47:37,189][INFO ][o.e.n.Node ] [hdp-01-master] node name [hdp-01-master], node ID [8YNfaQMsSoOdfuFmiic5NQ]
[2017-06-13T10:47:37,189][INFO ][o.e.n.Node ] [hdp-01-master] version[5.4.0], pid[12036], build[780f8c4/2017-04-28T17:43:27.229Z], OS[Linux/3.10.0-514.21.1.el7.x86_64/amd64], JVM[Oracle Corporation/OpenJDK 64-Bit Server VM/1.8.0_131/25.131-b12]
[2017-06-13T10:47:38,017][INFO ][o.e.p.PluginsService ] [hdp-01-master] loaded module [aggs-matrix-stats]
[2017-06-13T10:47:38,017][INFO ][o.e.p.PluginsService ] [hdp-01-master] loaded module [ingest-common]
[2017-06-13T10:47:38,017][INFO ][o.e.p.PluginsService ] [hdp-01-master] loaded module [lang-expression]
[2017-06-13T10:47:38,017][INFO ][o.e.p.PluginsService ] [hdp-01-master] loaded module [lang-groovy]
[2017-06-13T10:47:38,017][INFO ][o.e.p.PluginsService ] [hdp-01-master] loaded module [lang-mustache]
[2017-06-13T10:47:38,018][INFO ][o.e.p.PluginsService ] [hdp-01-master] loaded module [lang-painless]
[2017-06-13T10:47:38,018][INFO ][o.e.p.PluginsService ] [hdp-01-master] loaded module [percolator]
[2017-06-13T10:47:38,018][INFO ][o.e.p.PluginsService ] [hdp-01-master] loaded module [reindex]
[2017-06-13T10:47:38,018][INFO ][o.e.p.PluginsService ] [hdp-01-master] loaded module [transport-netty3]
[2017-06-13T10:47:38,018][INFO ][o.e.p.PluginsService ] [hdp-01-master] loaded module [transport-netty4]
[2017-06-13T10:47:38,019][INFO ][o.e.p.PluginsService ] [hdp-01-master] no plugins loaded
[2017-06-13T10:47:43,155][INFO ][o.e.d.DiscoveryModule ] [hdp-01-master] using discovery type [zen]
[2017-06-13T10:47:44,500][INFO ][o.e.n.Node ] [hdp-01-master] initialized
[2017-06-13T10:47:44,500][INFO ][o.e.n.Node ] [hdp-01-master] starting ...
[2017-06-13T10:47:44,678][INFO ][o.e.t.TransportService ] [hdp-01-master] publish_address {10.7.10.212:9300}, bound_addresses {10.7.10.212:9300}, {[::1]:9300}, {127.0.0.1:9300}, {[fe80::42f2:e9ff:fec3:6358]:9300}
[2017-06-13T10:47:44,685][INFO ][o.e.b.BootstrapChecks ] [hdp-01-master] bound or publishing to a non-loopback or non-link-local address, enforcing bootstrap checks
[2017-06-13T10:47:58,971][INFO ][o.e.c.s.ClusterService ] [hdp-01-master] detected_master {hdp-02-master}{pxnoVS7JThaFx97fZ7CP3g}{flTcV3ijQsaf1tZ_t2-TpQ}{10.7.10.214}{10.7.10.214:9300}, added {{hdp-03-data-01}{reY2QtW7SLGqRrBWBzIzxQ}{5FdcgCdsSqeYPsoeN
scMTg}{10.7.10.216}{10.7.10.216:9301},{hdp-06-data-01}{7xt5lYsPQIC1q7xZu1dWBQ}{N8DvE1lCSKyh_7emtJblzA}{10.7.10.219}{10.7.10.219:9301},{hdp-04-data-01}{1twQnDaYQc6d3rNXROd1SA}{rtFhFHqjQhGToQKKIhTkVg}{10.7.10.217}{10.7.10.217:9301},{hdp-05-data-02}{_b
ppTnQ6SC6YjOzY0TvVBw}{i6Nl_GviRXK-aOem-Tim8g}{10.7.10.218}{10.7.10.218:9302},{hdp-07-data-01}{w91AB0WcQcuVLfslgL1Urw}{qLnVUwRQRSWJ3JCpvdLxrQ}{10.7.10.220}{10.7.10.220:9301},{hdp-06-data-02}{2BJg1e3eRgmf32l8XPGGIA}{P2AUqUY_S3eP7dRPAFN-Sg}{10.7.10.219}{10.4
.108.219:9302},{hdp-02-master}{pxnoVS7JThaFx97fZ7CP3g}{flTcV3ijQsaf1tZ_t2-TpQ}{10.7.10.214}{10.7.10.214:9300},{hdp-04-data-02}{mWFL6GKOS4yrPIctN-xbjw}{2kEW8a2_SFaW9Ysp6PRhiw}{10.7.10.217}{10.7.10.217:9302},{hdp-07-data-02}{ROH4DptMThCatsgr0QMtrQ}{wcVz
VTQjRGi6QDQy_ywp1A}{10.7.10.220}{10.7.10.220:9302},{hdp-02-data-01}{hjKZgaWwTwmjJQvWifkCtQ}{K4u8608QSUi8Rh0PzL4Y-A}{10.7.10.214}{10.7.10.214:9301},{hdp-01-data-01}{15gD8dOhSW267pERz_tDxw}{z3ZEAibTRd67dXGjFEx5Gw}{10.7.10.212}{10.7.10.212:9301},{hdp-0
3-master}{vvgvw0pUQau1Wo_GwPqCTw}{HxQzGLEQSDWXXSOQAq2A2Q}{10.7.10.216}{10.7.10.216:9300},{hdp-05-data-01}{WwypVa1qQpGsB_FyNuSUTg}{H-a9nDgoQCeYeYuNBdy4PQ}{10.7.10.218}{10.7.10.218:9301},}, reason: zen-disco-receive(from master [master {hdp-02-master}{pxnoVS
7JThaFx97fZ7CP3g}{flTcV3ijQsaf1tZ_t2-TpQ}{10.7.10.214}{10.7.10.214:9300} committed version [1120]])
[2017-06-13T10:47:59,137][INFO ][o.e.h.n.Netty4HttpServerTransport] [hdp-01-master] publish_address {10.7.10.212:9200}, bound_addresses {10.7.10.212:9200}, {[::1]:9200}, {127.0.0.1:9200}, {[fe80::42f2:e9ff:fec3:6358]:9200}
[2017-06-13T10:47:59,142][INFO ][o.e.n.Node ] [hdp-01-master] started
[2017-06-13T10:47:59,456][DEBUG][o.e.a.s.TransportSearchAction] [hdp-01-master] All shards failed for phase: [query]
[2017-06-13T10:47:59,456][DEBUG][o.e.a.s.TransportSearchAction] [hdp-01-master] All shards failed for phase: [query]
[2017-06-13T10:47:59,456][DEBUG][o.e.a.s.TransportSearchAction] [hdp-01-master] All shards failed for phase: [query]
[2017-06-13T10:47:59,458][WARN ][r.suppressed ] path: /vk_20170613/_search, params: {scroll=10m, index=vk_20170613}
org.elasticsearch.action.search.SearchPhaseExecutionException: all shards failed

j0hnni3 · June 14, 2017, 10:00am

Log standby master file:

> [2017-06-13T10:45:40,661][INFO ][o.e.d.z.ZenDiscovery ] [hdp-02-master] master_left [{hdp-01-master}{8YNfaQMsSoOdfuFmiic5NQ}{T7jwtU_bScyJ_tXs2aHojw}{10.7.10.212}{10.7.10.212:9300}], reason [shut_down]

[2017-06-13T10:45:40,664][WARN ][o.e.d.z.ZenDiscovery ] [hdp-02-master] master left (reason = shut_down), current nodes: nodes:
{hdp-06-data-01}{7xt5lYsPQIC1q7xZu1dWBQ}{N8DvE1lCSKyh_7emtJblzA}{10.7.10.219}{10.7.10.219:9301}
{hdp-04-data-01}{1twQnDaYQc6d3rNXROd1SA}{rtFhFHqjQhGToQKKIhTkVg}{10.7.10.217}{10.7.10.217:9301}
{hdp-05-data-02}{_bppTnQ6SC6YjOzY0TvVBw}{i6Nl_GviRXK-aOem-Tim8g}{10.7.10.218}{10.7.10.218:9302}
{hdp-07-data-01}{w91AB0WcQcuVLfslgL1Urw}{qLnVUwRQRSWJ3JCpvdLxrQ}{10.7.10.220}{10.7.10.220:9301}
{hdp-04-data-02}{mWFL6GKOS4yrPIctN-xbjw}{2kEW8a2_SFaW9Ysp6PRhiw}{10.7.10.217}{10.7.10.217:9302}
{hdp-01-master}{8YNfaQMsSoOdfuFmiic5NQ}{T7jwtU_bScyJ_tXs2aHojw}{10.7.10.212}{10.7.10.212:9300}, master
{hdp-05-data-01}{WwypVa1qQpGsB_FyNuSUTg}{H-a9nDgoQCeYeYuNBdy4PQ}{10.7.10.218}{10.7.10.218:9301}
{hdp-02-data-01}{hjKZgaWwTwmjJQvWifkCtQ}{K4u8608QSUi8Rh0PzL4Y-A}{10.7.10.214}{10.7.10.214:9301}
{hdp-07-data-02}{ROH4DptMThCatsgr0QMtrQ}{wcVzVTQjRGi6QDQy_ywp1A}{10.7.10.220}{10.7.10.220:9302}
{hdp-01-data-01}{15gD8dOhSW267pERz_tDxw}{z3ZEAibTRd67dXGjFEx5Gw}{10.7.10.212}{10.7.10.212:9301}
{hdp-03-data-01}{reY2QtW7SLGqRrBWBzIzxQ}{5FdcgCdsSqeYPsoeNscMTg}{10.7.10.216}{10.7.10.216:9301}
{hdp-06-data-02}{2BJg1e3eRgmf32l8XPGGIA}{P2AUqUY_S3eP7dRPAFN-Sg}{10.7.10.219}{10.7.10.219:9302}
{hdp-03-master}{vvgvw0pUQau1Wo_GwPqCTw}{HxQzGLEQSDWXXSOQAq2A2Q}{10.7.10.216}{10.7.10.216:9300}
{hdp-02-master}{pxnoVS7JThaFx97fZ7CP3g}{flTcV3ijQsaf1tZ_t2-TpQ}{10.7.10.214}{10.7.10.214:9300}, local
[2017-06-13T10:45:43,673][WARN ][o.e.d.z.ZenDiscovery ] [hdp-02-master] failed to connect to master [{hdp-01-master}{8YNfaQMsSoOdfuFmiic5NQ}{T7jwtU_bScyJ_tXs2aHojw}{10.7.10.212}{10.7.10.212:9300}], retrying...
org.elasticsearch.transport.ConnectTransportException: [hdp-01-master][10.7.10.212:9300] connect_timeout[30s]
at org.elasticsearch.transport.netty4.Netty4Transport.connectToChannels(Netty4Transport.java:359) ~[?:?]
at org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:526) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.transport.TcpTransport.connectToNode(TcpTransport.java:465) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:315) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:302) ~[elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.discovery.zen.ZenDiscovery.joinElectedMaster(ZenDiscovery.java:468) [elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.discovery.zen.ZenDiscovery.innerJoinCluster(ZenDiscovery.java:420) [elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.discovery.zen.ZenDiscovery.access$4100(ZenDiscovery.java:83) [elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.discovery.zen.ZenDiscovery$JoinThreadControl$1.run(ZenDiscovery.java:1197) [elasticsearch-5.4.0.jar:5.4.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) [elasticsearch-5.4.0.jar:5.4.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 10.7.10.212/10.7.10.212:9300
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:?]
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) ~[?:?]
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:352) ~[?:?]
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
[2017-06-13T10:45:46,755][INFO ][o.e.c.s.ClusterService ] [hdp-02-master] new_master {hdp-02-master}{pxnoVS7JThaFx97fZ7CP3g}{flTcV3ijQsaf1tZ_t2-TpQ}{10.7.10.214}{10.7.10.214:9300}, reason: zen-disco-elected-as-master ([1] nodes joined)[{hdp-03-master}
{vvgvw0pUQau1Wo_GwPqCTw}{HxQzGLEQSDWXXSOQAq2A2Q}{10.7.10.216}{10.7.10.216:9300}]
[2017-06-13T10:45:46,769][WARN ][o.e.c.a.s.ShardStateAction] [hdp-02-master] [vk_20170607_merge][1] received shard failed for shard id [[vk_20170607_merge][1]], allocation id [zk-Ez1c7QZmqhbJFeV8BUw], primary term [0], message [master marked shard as active, but shard has not been created, mark shard as failed]
[2017-06-13T10:45:46,769][WARN ][o.e.c.a.s.ShardStateAction] [hdp-02-master] [vk_20170607_merge][0] received shard failed for shard id [[vk_20170607_merge][0]], allocation id [XaZTO1rEQlCLdwCrXbJzxw], primary term [0], message [master marked shard as active, but shard has not been created, mark shard as failed]
[2017-06-13T10:45:46,769][WARN ][o.e.c.a.s.ShardStateAction] [hdp-02-master] [buzz_20170607_merge][6] received shard failed for shard id [[buzz_20170607_merge][6]], allocation id [rXf2lzH7QK2y8_eS9UC2hA], primary term [0], message [master marked shard as active, but shard has not been created, mark shard as failed]
[2017-06-13T10:45:46,769][WARN ][o.e.c.a.s.ShardStateAction] [hdp-02-master] [vk_20170604_merge][3] received shard failed for shard id [[vk_20170604_merge][3]], allocation id [icTS5DpwSem22HNApST6cg], primary term [0], message [master marked shard as active, but shard has not been created, mark shard as failed]
[2017-06-13T10:45:46,770][WARN ][o.e.c.a.s.ShardStateAction] [hdp-02-master] [vk_20170611][7] received shard failed for shard id [[vk_20170611][7]], allocation id [L_i6lzDoRnSdx3ZLIzo2-w], primary term [0], message [master marked shard as active, but shard has not been created, mark shard as failed]

Further in a log file there are records "received shard failed for shard id" about all shards of a cluster

warkolm · June 18, 2017, 8:58am

no_master_block being all which means that once a master disappears then the cluster cannot do anything, like allocate.

j0hnni3 · June 18, 2017, 9:19am

Thanks for your reply!

The documentation says:
The discovery.zen.no_master_block setting all:
All operations on the node—i.e. both read & writes—will be rejected. This also applies for api cluster state read or write operations, like the get index settings, put mapping and cluster state api.

Does that mean that the allocate belongs to the "api cluster"?
Correctly I understand that if we switch this setting to the parameter "write" the cluster will not transition to the state RED when the active master is shut down (assuming that the other two are running)?

system · July 16, 2017, 9:33am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Help understanding "UnavailableShardsException" error Elasticsearch	4	12773	July 6, 2017
Shard failure after restart of node - ES 1.7.5 Elasticsearch	7	671	July 5, 2017
Loss of Elasticsearch Replicas/Shards After Node Failures Elasticsearch	4	239	December 6, 2023
Cluster High Availability Elasticsearch	4	388	March 2, 2020
All shards failed after one node shutdown Elasticsearch	2	1283	January 19, 2018

All shards unavailable after one master of the three left, ES 5.4

The contents of the log files

Related topics