Could not get cluster status after master node goes down

Christian_Dahlqvist · May 29, 2018, 7:26am

You seem to have a few non-default settings, e.g. gateway.recover_after_nodes and gateway.expected_nodes, that may not be appropriate for a 3-node cluster. How did you arrive at these (and other non-default) values?

As you have indices.ttl.interval in there it looks like you have just reused the 1.7 config. I would recommend removing non-default options and start building it out from there with discovery.zen.minimum_master_nodes and discovery.zen.ping.unicast.hosts set to appropriate values.

Vahid · June 1, 2018, 3:33pm

Hi Christian,

Thank you for the helpful hint, after removing the "discovery.zen.fd.ping_timeout", "discovery.zen.fd.ping_retries" and "discovery.zen.ping_timeout" the situation get much improved. I don't know what is going on wit ES, when these three are set . Anyway

I could reproduce the problem constantly on my local machine. After removing these 3 from the ES configuration on the cluster of our test einvironment (on customer site), the situation got improved much, but still some times after stopping the master, the problem occurs (localhost:9200/_cluster/health just hang on and not responding), also after starting the crashed node, it doesn't join the other two nodes. I tried the same scenario event with removing all the non-default options as you mentioned. Still some times the problems occurs.

there are 8100 shards in the cluster disributed on three nodes each 8g memory assigned to ES.

Cheers,
Vahid

Christian_Dahlqvist · June 1, 2018, 3:39pm

That is in my opinion far too many shards for a cluster that size, which is probably contributing to the problems. Read this blog post for some guidance on shards and sharding.

Vahid · June 1, 2018, 3:48pm

Yes I know there are too many shards, however currently there is no easy way to avoid that. The problem is that all these were working fine with the old version (1.7) and now we get into a big problem with the new version. Either we should revert everything back to use 1.7 (a serious problem with customers) or retruct the data model which costs alot of works to change the code...

So I'm trying to somehow make it working in some kind of stable at the moment and in parallel maybe find a better solution.

Vahid · June 5, 2018, 12:40pm

I've reduced the number of shards to 494 for all three nodes according to this https://www.elastic.co/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster

ES heap size is set to 8GB, so for each node about 8*20=160 shards. All the indexes are almost empty, almost no search, no indexing...

After stopping the master, on the other two nodes some times GET request on "localhost:9200/_cluster/health" get stuck and doesn't respond and sometimes display the wrong information (display number_of_data_nodes 3 which must be 2). Also after starting the stopped node, it doesn't join the cluster.

This is the logs of ES on the restarting node (old master):

22.107.22:9300}], reason [ElasticsearchTimeoutException[java.util.concurrent.TimeoutException: 
Timeout waiting for task.]; nested: TimeoutException[Timeout waiting for task.]; ]
[2018-06-05T14:25:59,993][DEBUG][o.e.a.a.i.e.i.TransportIndicesExistsAction] [172.22.107.20:10000] 
timed out while retrying [indices:admin/exists] after failure (timeout [30s])
[2018-06-05T14:26:05,009][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [172.22.107.20:10000] 
no known master node, scheduling a retry
[2018-06-05T14:26:35,010][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [172.22.107.20:10000] 
timed out while retrying [cluster:monitor/health] after failure (timeout [30s])
[2018-06-05T14:26:35,023][DEBUG][o.e.a.a.i.e.i.TransportIndicesExistsAction] [172.22.107.20:10000] no 
known master node, scheduling a retry
[2018-06-05T14:26:54,323][INFO ][o.e.d.z.ZenDiscovery     ] [172.22.107.20:10000] failed to send join 
request to master [{172.22.107.22:10000}{9-c6QHDEQymQo0bl-3hjxQ}{4y3uWlY6SMGAhe2rbeEE0w} 
{172.22.107.22}{172.22.107.22:9300}], reason 
[ElasticsearchTimeoutException[java.util.concurrent.TimeoutException: Timeout waiting for task.]; 
nested: TimeoutException[Timeout waiting for task.]; ]
[2018-06-05T14:27:05,024][DEBUG][o.e.a.a.i.e.i.TransportIndicesExistsAction] [172.22.107.20:10000] 
timed out while retrying [indices:admin/exists] after failure (timeout [30s])
[2018-06-05T14:27:10,037][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [172.22.107.20:10000] 
no known master node, scheduling a retry
[2018-06-05T14:27:40,038][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [172.22.107.20:10000] 
timed out while retrying [cluster:monitor/health] after failure (timeout [30s])
[2018-06-05T14:27:40,050][DEBUG][o.e.a.a.i.e.i.TransportIndicesExistsAction] [172.22.107.20:10000] no 
known master node, scheduling a retry
[2018-06-05T14:27:49,774][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [172.22.107.20:10000] 
no known master node, scheduling a retry
[2018-06-05T14:27:57,332][INFO ][o.e.d.z.ZenDiscovery     ] [172.22.107.20:10000] failed to send join 
request to master [{172.22.107.22:10000}{9-c6QHDEQymQo0bl-3hjxQ}{4y3uWlY6SMGAhe2rbeEE0w} 
{172.22.107.22}{172.22.107.22:9300}], reason 
[ElasticsearchTimeoutException[java.util.concurrent.TimeoutException: Timeout waiting for task.]; 
nested: TimeoutException[Timeout waiting for task.]; ]
[2018-06-05T14:28:10,051][DEBUG][o.e.a.a.i.e.i.TransportIndicesExistsAction] [172.22.107.20:10000] 
timed out while retrying [indices:admin/exists] after failure (timeout [30s])
[2018-06-05T14:28:15,065][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [172.22.107.20:10000]     
no known master node, scheduling a retry
[2018-06-05T14:28:19,775][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [172.22.107.20:10000] 
timed out while retrying [cluster:monitor/health] after failure (timeout [30s])
[2018-06-05T14:28:19,778][WARN ][r.suppressed             ] path: /_cluster/health, params: {}
org.elasticsearch.discovery.MasterNotDiscoveredException: null
    at 


  org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$4.onTimeout(TransportMasterNodeAction.java:209) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:311) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:238) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.cluster.service.ClusterService$NotifyTimeout.run(ClusterService.java:1056) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) [elasticsearch-5.6.7.jar:5.6.7]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_162]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_162]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_162]

there are 32 GB of memory on each node 14GB is free
java version 1.8.0_162

dadoonet · June 5, 2018, 12:55pm

Why this? It's a total waste of resources to have lot of empty shards...

Vahid · June 5, 2018, 1:18pm

We must save data for each customer separately in different indexes, otherwise our software won't be certified

dadoonet · June 5, 2018, 1:33pm

We must save data for each customer separately in different indexes

Fair enough. May be this strategy will require more data nodes then...
Note that you can use filtered aliases to virtually separate the data.

Vahid · June 5, 2018, 2:00pm

I think it has nothing to do with number of shards....

I've closed all the indices but one, still the same issue...
after stopping the master, executing the command "curl -XGET localhost:9200/_cluster/health?pretty" on the other two nodes just get stuck and after about 30s return "error 503 master_not_discovered_exception..."

Vahid · June 6, 2018, 4:37pm

I've reduced the number of shards to 100 on three nodes.

Now cluster is faster, but the discovery problem still is there. By shutting down a master, other two nodes still trying to connect to the gone node and never do a master election although they are also master eligible.

In the logs it says that master is left and lists the available nodes:

master-left

however it tries to connect to it and never give up....

failed-to-connect-to-let-master

dadoonet · June 6, 2018, 4:51pm

Please don't post images of text as they are hardly readable and not searchable.

Instead paste the text and format it with </> icon. Check the preview window.

Could you also share your nodes settings and the master eligible nodes logs?

Vahid · June 6, 2018, 5:20pm

I used formatting in previous comments, however wanted to higlight some lines and sometimes they are some sensitive informations in the logs which I want to hide them in the images...

this is the nodes settings, same for all three:

discovery.zen.ping.unicast.hosts: ["172.22.107.20:9300","172.22.107.21:9300","172.22.107.22:9300"]
cluster.name: <cluster-name>
discovery.zen.minimum_master_nodes: 2
http.port: 9200
path.data: /opt/bdm4_data
network.host: 0.0.0.0
node.name: 172.22.107.21:10000
transport.tcp.port: 9300
action.auto_create_index: false

Vahid · June 6, 2018, 5:23pm

this is the log of one of the master-eligible nodes. On both are the same:

 [2018-06-06T17:23:39,029][INFO ][o.e.n.Node               ] [172.22.107.20:10000] started
[2018-06-06T17:24:35,455][INFO ][o.e.c.s.ClusterService   ] [172.22.107.20:10000] added {{172.22.107.21:10000}{b1rXMHeJTh2LfTN1klzHzA}{KCRr-g4WSyOrMO4XUJqwVg}{172.22.107.21}{172.22.107.21:9300},}, reason: zen-disco-receive(from master [master {172.22.107.22:10000}{ds3DRQbwR2qQg7S9x-ljfw}{yTY4XQFsReiv4dxEaT67cA}{172.22.107.22}{172.22.107.22:9300} committed version [46]])
[2018-06-06T17:25:14,044][INFO ][o.e.d.z.ZenDiscovery     ] [172.22.107.20:10000] master_left [{172.22.107.22:10000}{ds3DRQbwR2qQg7S9x-ljfw}{yTY4XQFsReiv4dxEaT67cA}{172.22.107.22}{172.22.107.22:9300}], reason [shut_down]
[2018-06-06T17:25:14,047][WARN ][o.e.d.z.ZenDiscovery     ] [172.22.107.20:10000] master left (reason = shut_down), current nodes: nodes:
   {172.22.107.20:10000}{w49SCPmcRvSsW0_948B_rA}{xsBLwTuoRx6gJ5ew9SzUAw}{172.22.107.20}{172.22.107.20:9300}, local
   {172.22.107.22:10000}{ds3DRQbwR2qQg7S9x-ljfw}{yTY4XQFsReiv4dxEaT67cA}{172.22.107.22}{172.22.107.22:9300}, master
   {172.22.107.21:10000}{b1rXMHeJTh2LfTN1klzHzA}{KCRr-g4WSyOrMO4XUJqwVg}{172.22.107.21}{172.22.107.21:9300}

[2018-06-06T17:25:14,478][DEBUG][o.e.a.a.c.s.TransportClusterStateAction] [172.22.107.20:10000] no known master node, scheduling a retry
[2018-06-06T17:25:14,870][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [172.22.107.20:10000] no known master node, scheduling a retry
[2018-06-06T17:25:16,545][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [172.22.107.20:10000] no known master node, scheduling a retry
[2018-06-06T17:25:17,160][INFO ][o.e.c.s.ClusterService   ] [172.22.107.20:10000] detected_master {172.22.107.21:10000}{b1rXMHeJTh2LfTN1klzHzA}{KCRr-g4WSyOrMO4XUJqwVg}{172.22.107.21}{172.22.107.21:9300}, reason: zen-disco-receive(from master [master {172.22.107.21:10000}{b1rXMHeJTh2LfTN1klzHzA}{KCRr-g4WSyOrMO4XUJqwVg}{172.22.107.21}{172.22.107.21:9300} committed version [77]])
[2018-06-06T17:25:17,170][WARN ][o.e.c.NodeConnectionsService] [172.22.107.20:10000] failed to connect to node {172.22.107.22:10000}{ds3DRQbwR2qQg7S9x-ljfw}{yTY4XQFsReiv4dxEaT67cA}{172.22.107.22}{172.22.107.22:9300} (tried [1] times)
org.elasticsearch.transport.ConnectTransportException: [172.22.107.22:10000][172.22.107.22:9300] connect_timeout[30s]
        at org.elasticsearch.transport.netty4.Netty4Transport.connectToChannels(Netty4Transport.java:363) ~[?:?]
        at org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:570) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.transport.TcpTransport.connectToNode(TcpTransport.java:473) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:342) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:329) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.cluster.NodeConnectionsService.validateAndConnectIfNeeded(NodeConnectionsService.java:154) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.cluster.NodeConnectionsService$1.doRun(NodeConnectionsService.java:107) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.6.7.jar:5.6.7]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_162]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_162]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_162]
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 172.22.107.22/172.22.107.22:9300
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:?]
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) ~[?:?]
        at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:352) ~[?:?]
        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:632) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:544) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458) ~[?:?]
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) ~[?:?]
[2018-06-06T19:10:15,616][DEBUG][o.e.a.a.c.s.TransportClusterStateAction] [172.22.107.20:10000] no known master node, scheduling a retry
[2018-06-06T19:10:15,770][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [172.22.107.20:10000] no known master node, scheduling a retry
[2018-06-06T19:10:17,710][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [172.22.107.20:10000] no known master node, scheduling a retry
[2018-06-06T19:10:18,393][WARN ][o.e.c.NodeConnectionsService] [172.22.107.20:10000] failed to connect to node {172.22.107.21:10000}{b1rXMHeJTh2LfTN1klzHzA}{sJRRQk0IS46vdZmIkMNbFA}{172.22.107.21}{172.22.107.21:9300} (tried [1] times)
org.elasticsearch.transport.ConnectTransportException: [172.22.107.21:10000][172.22.107.21:9300] connect_timeout[30s]
        at org.elasticsearch.transport.netty4.Netty4Transport.connectToChannels(Netty4Transport.java:363) ~[?:?]
        at org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:570) ~[elasticsearch-5.6.7.jar:5.6.7]

dadoonet · June 6, 2018, 5:41pm

When all your nodes are running, could you run:

GET /_cat/nodes?v

Then kill the master node and do that again on whatever node?

Vahid · June 7, 2018, 8:01am

this is the output when all the nodes are running:

ip            heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
172.22.107.20            2          54   9    1.48    0.68     0.33 mdi       -      172.22.107.20:10000
172.22.107.22            3          65   9    0.15    0.20     0.17 mdi       -      172.22.107.22:10000
172.22.107.21            2          54  11    0.30    0.26     0.14 mdi       *      172.22.107.21:10000

and this is the output when master is killed:

ip            heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
172.22.107.20            1          54   5    0.88    0.69     0.36 mdi       -      172.22.107.20:10000
172.22.107.22            2          65  17    0.37    0.22     0.18 mdi       *      172.22.107.22:10000

the node respond to the command GET /_cat/node_s?v

but no response to

curl -XGET localhost:9200/_cluster/health?pretty

Also after restarting the killed master, it never join the cluster. This is the log for the killed master after restarting:

[2018-06-07T09:56:40,416][INFO ][o.e.d.DiscoveryModule    ] [172.22.107.21:10000] using discovery type [zen]
[2018-06-07T09:56:41,644][INFO ][o.e.n.Node               ] [172.22.107.21:10000] initialized
[2018-06-07T09:56:41,644][INFO ][o.e.n.Node               ] [172.22.107.21:10000] starting ...
[2018-06-07T09:56:41,835][INFO ][o.e.t.TransportService   ] [172.22.107.21:10000] publish_address {172.22.107.21:9300}, bound_addresses {[::]:9300}
[2018-06-07T09:56:41,859][INFO ][o.e.b.BootstrapChecks    ] [172.22.107.21:10000] bound or publishing to a non-loopback address, enforcing bootstrap checks
[2018-06-07T09:56:45,984][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [172.22.107.21:10000] no known master node, scheduling a retry
[2018-06-07T09:57:11,907][WARN ][o.e.n.Node               ] [172.22.107.21:10000] timed out while waiting for initial discovery state - timeout: 30s
[2018-06-07T09:57:11,920][INFO ][o.e.h.n.Netty4HttpServerTransport] [172.22.107.21:10000] publish_address {172.22.107.21:9200}, bound_addresses {[::]:9200}
[2018-06-07T09:57:11,920][INFO ][o.e.n.Node               ] [172.22.107.21:10000] started
[2018-06-07T09:57:15,988][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [172.22.107.21:10000] timed out while retrying [cluster:monitor/health] after failure (timeout [30s])
[2018-06-07T09:57:16,027][DEBUG][o.e.a.a.i.e.i.TransportIndicesExistsAction] [172.22.107.21:10000] no known master node, scheduling a retry
[2018-06-07T09:57:45,018][INFO ][o.e.d.z.ZenDiscovery     ] [172.22.107.21:10000] failed to send join request to master [{172.22.107.22:10000}{ds3DRQbwR2qQg7S9x-ljfw}{CTegKdEiQh2i2olrKdibrg}{172.22.107.22}{172.22.107.22:9300}], reason [ElasticsearchTimeoutException[java.util.concurrent.TimeoutException: Timeout waiting for task.]; nested: TimeoutException[Timeout waiting for task.]; ]
[2018-06-07T09:57:46,029][DEBUG][o.e.a.a.i.e.i.TransportIndicesExistsAction] [172.22.107.21:10000] timed out while retrying [indices:admin/exists] after failure (timeout [30s])
[2018-06-07T09:57:51,042][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [172.22.107.21:10000] no known master node, scheduling a retry
[2018-06-07T09:58:10,074][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [172.22.107.21:10000] no known master node, scheduling a retry
[2018-06-07T09:58:21,044][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [172.22.107.21:10000] timed out while retrying [cluster:monitor/health] after failure (timeout [30s])
[2018-06-07T09:58:21,058][DEBUG][o.e.a.a.i.e.i.TransportIndicesExistsAction] [172.22.107.21:10000] no known master node, scheduling a retry
[2018-06-07T09:58:40,076][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [172.22.107.21:10000] timed out while retrying [cluster:monitor/health] after failure (timeout [30s])
[2018-06-07T09:58:40,080][WARN ][r.suppressed             ] path: /_cluster/health, params: {pretty=}
org.elasticsearch.discovery.MasterNotDiscoveredException: null
        at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$4.onTimeout(TransportMasterNodeAction.java:209) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:311) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:238) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.cluster.service.ClusterService$NotifyTimeout.run(ClusterService.java:1056) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) [elasticsearch-5.6.7.jar:5.6.7]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_162]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_162]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_162]
[2018-06-07T09:58:48,030][INFO ][o.e.d.z.ZenDiscovery     ] [172.22.107.21:10000] failed to send join request to master [{172.22.107.22:10000}{ds3DRQbwR2qQg7S9x-ljfw}{CTegKdEiQh2i2olrKdibrg}{172.22.107.22}{172.22.107.22:9300}], reason [ElasticsearchTimeoutException[java.util.concurrent.TimeoutException: Timeout waiting for task.]; nested: TimeoutException[Timeout waiting for task.]; ]
[2018-06-07T09:58:51,059][DEBUG][o.e.a.a.i.e.i.TransportIndicesExistsAction] [172.22.107.21:10000] timed out while retrying [indices:admin/exists] after failure (timeout [30s])
[2018-06-07T09:58:56,082][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [172.22.107.21:10000] no known master node, scheduling a retry
[2018-06-07T09:59:26,083][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [172.22.107.21:10000] timed out while retrying [cluster:monitor/health] after failure (timeout [30s])
[2018-06-07T09:59:26,093][DEBUG][o.e.a.a.i.e.i.TransportIndicesExistsAction] [172.22.107.21:10000] no known master node, scheduling a retry
[2018-06-07T09:59:51,040][INFO ][o.e.d.z.ZenDiscovery     ] [172.22.107.21:10000] failed to send join request to master [{172.22.107.22:10000}{ds3DRQbwR2qQg7S9x-ljfw}{CTegKdEiQh2i2olrKdibrg}{172.22.107.22}{172.22.107.22:9300}], reason [ElasticsearchTimeoutException[java.util.concurrent.TimeoutException: Timeout waiting for task.]; nested: TimeoutException[Timeout waiting for task.]; ]
[2018-06-07T09:59:56,094][DEBUG][o.e.a.a.i.e.i.TransportIndicesExistsAction] [172.22.107.21:10000] timed out while retrying [indices:admin/exists] after failure (timeout [30s])
[2018-06-07T10:00:01,109][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [172.22.107.21:10000] no known master node, scheduling a retry

dadoonet · June 7, 2018, 8:27am

When the master node is killed (step 2), could you run this:

curl -XGET 172.22.107.22:9200/_cluster/health?pretty
curl -XGET 172.22.107.20:9200/_cluster/health?pretty

After node 172.22.107.21 is killed we can clearly see that master node is now 172.22.107.22 which looks fine.

BTW could you share the full logs of the 3 nodes?

Also after restarting the killed master, it never join the cluster.

I suspect something bad on the network side, like a firewall or something.
May be you have something which prevents node 172.22.107.21 to connect to 172.22.107.22 on port 9300?

Could check that? Including trying to telnet on that port from 172.22.107.21?

Vahid · June 7, 2018, 8:50am

command

curl -XGET 172.22.107.22:9200/_cluster/health?pretty

doesn't work since for the security reasons 9200 is closed for other ips.

port 9300 is reachable from other nodes, telnet respond.

It's not possible to upload txt files on the forum. I just can copy/paste the logs here which is limited to 7000 characters.

Vahid · June 7, 2018, 8:53am

This is the logs for node 172.22.107.22 which is the new master:

[2018-06-07T09:41:04,827][INFO ][o.e.p.PluginsService     ] [172.22.107.22:10000] loaded module [aggs-matrix-stats]
[2018-06-07T09:41:04,828][INFO ][o.e.p.PluginsService     ] [172.22.107.22:10000] loaded module [ingest-common]
[2018-06-07T09:41:04,828][INFO ][o.e.p.PluginsService     ] [172.22.107.22:10000] loaded module [lang-expression]
[2018-06-07T09:41:04,828][INFO ][o.e.p.PluginsService     ] [172.22.107.22:10000] loaded module [lang-groovy]
[2018-06-07T09:41:04,828][INFO ][o.e.p.PluginsService     ] [172.22.107.22:10000] loaded module [lang-mustache]
[2018-06-07T09:41:04,828][INFO ][o.e.p.PluginsService     ] [172.22.107.22:10000] loaded module [lang-painless]
[2018-06-07T09:41:04,828][INFO ][o.e.p.PluginsService     ] [172.22.107.22:10000] loaded module [parent-join]
[2018-06-07T09:41:04,828][INFO ][o.e.p.PluginsService     ] [172.22.107.22:10000] loaded module [percolator]
[2018-06-07T09:41:04,828][INFO ][o.e.p.PluginsService     ] [172.22.107.22:10000] loaded module [reindex]
[2018-06-07T09:41:04,829][INFO ][o.e.p.PluginsService     ] [172.22.107.22:10000] loaded module [transport-netty3]
[2018-06-07T09:41:04,829][INFO ][o.e.p.PluginsService     ] [172.22.107.22:10000] loaded module [transport-netty4]
[2018-06-07T09:41:04,829][INFO ][o.e.p.PluginsService     ] [172.22.107.22:10000] no plugins loaded
[2018-06-07T09:41:07,000][INFO ][o.e.d.DiscoveryModule    ] [172.22.107.22:10000] using discovery type [zen]
[2018-06-07T09:41:08,256][INFO ][o.e.n.Node               ] [172.22.107.22:10000] initialized
[2018-06-07T09:41:08,256][INFO ][o.e.n.Node               ] [172.22.107.22:10000] starting ...
[2018-06-07T09:41:08,481][INFO ][o.e.t.TransportService   ] [172.22.107.22:10000] publish_address {172.22.107.22:9300}, bound_addresses {[:
:]:9300}
[2018-06-07T09:41:08,502][INFO ][o.e.b.BootstrapChecks    ] [172.22.107.22:10000] bound or publishing to a non-loopback address, enforcing
bootstrap checks
[2018-06-07T09:41:12,023][INFO ][o.e.c.s.ClusterService   ] [172.22.107.22:10000] detected_master {172.22.107.21:10000}{b1rXMHeJTh2LfTN1klz
HzA}{00vO7HZuQcq2LfLYZ-zQAg}{172.22.107.21}{172.22.107.21:9300}, added {{172.22.107.21:10000}{b1rXMHeJTh2LfTN1klzHzA}{00vO7HZuQcq2LfLYZ-zQA
g}{172.22.107.21}{172.22.107.21:9300},}, reason: zen-disco-receive(from master [master {172.22.107.21:10000}{b1rXMHeJTh2LfTN1klzHzA}{00vO7H
ZuQcq2LfLYZ-zQAg}{172.22.107.21}{172.22.107.21:9300} committed version [1]])
[2018-06-07T09:41:12,057][INFO ][o.e.h.n.Netty4HttpServerTransport] [172.22.107.22:10000] publish_address {172.22.107.22:9200}, bound_addre
sses {[::]:9200}
[2018-06-07T09:41:12,058][INFO ][o.e.n.Node               ] [172.22.107.22:10000] started
[2018-06-07T09:42:37,811][INFO ][o.e.c.s.ClusterService   ] [172.22.107.22:10000] added {{172.22.107.20:10000}{w49SCPmcRvSsW0_948B_rA}{Vnlk
mgbtQuO793YmrfYDnA}{172.22.107.20}{172.22.107.20:9300},}, reason: zen-disco-receive(from master [master {172.22.107.21:10000}{b1rXMHeJTh2Lf
TN1klzHzA}{00vO7HZuQcq2LfLYZ-zQAg}{172.22.107.21}{172.22.107.21:9300} committed version [45]])
[2018-06-07T09:44:57,367][INFO ][o.e.d.z.ZenDiscovery     ] [172.22.107.22:10000] master_left [{172.22.107.21:10000}{b1rXMHeJTh2LfTN1klzHzA
}{00vO7HZuQcq2LfLYZ-zQAg}{172.22.107.21}{172.22.107.21:9300}], reason [shut_down]
    [2018-06-07T09:44:57,371][WARN ][o.e.d.z.ZenDiscovery     ] [172.22.107.22:10000] master left (reason = shut_down), current nodes: nodes:
       {172.22.107.20:10000}{w49SCPmcRvSsW0_948B_rA}{VnlkmgbtQuO793YmrfYDnA}{172.22.107.20}{172.22.107.20:9300}
       {172.22.107.22:10000}{ds3DRQbwR2qQg7S9x-ljfw}{CTegKdEiQh2i2olrKdibrg}{172.22.107.22}{172.22.107.22:9300}, local
       {172.22.107.21:10000}{b1rXMHeJTh2LfTN1klzHzA}{00vO7HZuQcq2LfLYZ-zQAg}{172.22.107.21}{172.22.107.21:9300}, master

    [2018-06-07T09:44:57,556][DEBUG][o.e.a.a.c.s.TransportClusterStateAction] [172.22.107.22:10000] no known master node, scheduling a retry
    [2018-06-07T09:44:58,284][WARN ][o.e.c.NodeConnectionsService] [172.22.107.22:10000] failed to connect to node {172.22.107.21:10000}{b1rXMH
    eJTh2LfTN1klzHzA}{00vO7HZuQcq2LfLYZ-zQAg}{172.22.107.21}{172.22.107.21:9300} (tried [1] times)
    org.elasticsearch.transport.ConnectTransportException: [172.22.107.21:10000][172.22.107.21:9300] connect_timeout[30s]
            at org.elasticsearch.transport.netty4.Netty4Transport.connectToChannels(Netty4Transport.java:363) ~[?:?]
            at org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:570) ~[elasticsearch-5.6.7.jar:5.6.7]
            at org.elasticsearch.transport.TcpTransport.connectToNode(TcpTransport.java:473) ~[elasticsearch-5.6.7.jar:5.6.7]
            at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:342) ~[elasticsearch-5.6.7.jar:5.6.7]
            at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:329) ~[elasticsearch-5.6.7.jar:5.6.7]
            at org.elasticsearch.cluster.NodeConnectionsService.validateAndConnectIfNeeded(NodeConnectionsService.java:154) [elasticsearch-5.6.
    7.jar:5.6.7]
            at org.elasticsearch.cluster.NodeConnectionsService$ConnectionChecker.doRun(NodeConnectionsService.java:183) [elasticsearch-5.6.7.j
    ar:5.6.7]
            at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elastics
    earch-5.6.7.jar:5.6.7]
            at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.6.7.jar:5.6.7]
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_162]
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_162]
            at java.lang.Thread.run(Thread.java:748) [?:1.8.0_162]
    Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 172.22.107.21/172.22.107.21:9300
            at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:?]
            at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) ~[?:?]
            at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:352) ~[?:?]

Vahid · June 7, 2018, 8:56am

[2018-06-07T09:44:59,627][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [172.22.107.22:10000] no known master node, scheduling a retry
[2018-06-07T09:45:00,427][INFO ][o.e.c.s.ClusterService   ] [172.22.107.22:10000] new_master {172.22.107.22:10000}{ds3DRQbwR2qQg7S9x-ljfw}{CTegKdEiQh2i2olrKdibrg}{172.22.107.22}{172.22.107.22:9300}, reason: zen-disco-elected-as-master ([1] nodes joined)[{172.22.107.20:10000}{w49SCPmcRvSsW0_948B_rA}{VnlkmgbtQuO793YmrfYDnA}{172.22.107.20}{172.22.107.20:9300}]
[2018-06-07T09:45:00,486][DEBUG][o.e.a.a.c.s.TransportClusterStateAction] [172.22.107.22:10000] no known master node, scheduling a retry
[2018-06-07T09:45:00,488][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [172.22.107.22:10000] no known master node, scheduling a retry
[2018-06-07T09:45:00,499][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [172.22.107.22:10000] failed to execute on node [b1rXMHeJTh2LfTN1klzHzA]
org.elasticsearch.transport.NodeNotConnectedException: [172.22.107.21:10000][172.22.107.21:9300] Node not connected
        at org.elasticsearch.transport.TcpTransport.getConnection(TcpTransport.java:640) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.transport.TcpTransport.getConnection(TcpTransport.java:117) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.transport.TransportService.getConnection(TransportService.java:540) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:516) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.start(TransportNodesAction.java:197) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.action.support.nodes.TransportNodesAction.doExecute(TransportNodesAction.java:89) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.action.support.nodes.TransportNodesAction.doExecute(TransportNodesAction.java:52) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:170) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:142) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:84) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.client.node.NodeClient.executeLocally(NodeClient.java:83) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.client.node.NodeClient.doExecute(NodeClient.java:72) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:408) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.client.support.AbstractClient$ClusterAdmin.execute(AbstractClient.java:730) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.client.support.AbstractClient$ClusterAdmin.nodesStats(AbstractClient.java:826) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.cluster.InternalClusterInfoService.updateNodeStats(InternalClusterInfoService.java:256) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.cluster.InternalClusterInfoService.refresh(InternalClusterInfoService.java:292) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.cluster.InternalClusterInfoService.maybeRefresh(InternalClusterInfoService.java:277) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.cluster.InternalClusterInfoService.lambda$onMaster$0(InternalClusterInfoService.java:137) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) [elasticsearch-5.6.7.jar:5.6.7]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_162]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_162]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_162]
[2018-06-07T09:45:00,507][DEBUG][o.e.a.a.i.s.TransportIndicesStatsAction] [172.22.107.22:10000] failed to execute [indices:monitor/stats] on node [b1rXMHeJTh2LfTN1klzHzA]
org.elasticsearch.transport.NodeNotConnectedException: [172.22.107.21:10000][172.22.107.21:9300] Node not connected
        at org.elasticsearch.transport.TcpTransport.getConnection(TcpTransport.java:640) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.transport.TcpTransport.getConnection(TcpTransport.java:117) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.transport.TransportService.getConnection(TransportService.java:540) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:503) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$AsyncAction.sendNodeRequest(TransportBroadcastByNodeAction.java:322) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$AsyncAction.start(TransportBroadcastByNodeAction.java:311) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction.doExecute(TransportBroadcastByNodeAction.java:234) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction.doExecute(TransportBroadcastByNodeAction.java:79) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:170) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:142) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:84) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.client.node.NodeClient.executeLocally(NodeClient.java:83) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.client.node.NodeClient.doExecute(NodeClient.java:72) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:408) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.client.support.AbstractClient$IndicesAdmin.execute(AbstractClient.java:1256) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.client.support.AbstractClient$IndicesAdmin.stats(AbstractClient.java:1577) ~[elasticsearch-5.6.7.jar:5.6.7]
        at

Vahid · June 7, 2018, 9:00am

[2018-06-07T09:45:00,705][INFO ][o.e.c.r.a.AllocationService] [172.22.107.22:10000] Cluster health status changed from [GREEN] to [YELLOW] (reason: [{172.22.107.21:10000}{b1rXMHeJTh2LfTN1klzHzA}{00vO7HZuQcq2LfLYZ-zQAg}{172.22.107.21}{172.22.107.21:9300} transport disconnected]).
[2018-06-07T09:45:00,706][INFO ][o.e.c.s.ClusterService   ] [172.22.107.22:10000] removed {{172.22.107.21:10000}{b1rXMHeJTh2LfTN1klzHzA}{00vO7HZuQcq2LfLYZ-zQAg}{172.22.107.21}{172.22.107.21:9300},}, reason: zen-disco-node-failed({172.22.107.21:10000}{b1rXMHeJTh2LfTN1klzHzA}{00vO7HZuQcq2LfLYZ-zQAg}{172.22.107.21}{172.22.107.21:9300}), reason(transport disconnected)[{172.22.107.21:10000}{b1rXMHeJTh2LfTN1klzHzA}{00vO7HZuQcq2LfLYZ-zQAg}{172.22.107.21}{172.22.107.21:9300} transport disconnected]
[2018-06-07T09:45:01,092][WARN ][o.e.a.b.TransportShardBulkAction] [172.22.107.22:10000] [[logs][0]] failed to perform indices:data/write/bulk[s] on replica [logs][0], node[b1rXMHeJTh2LfTN1klzHzA], [R], s[STARTED], a[id=bMPR72sBT-SVI4zdX8dn0Q]
org.elasticsearch.transport.NodeNotConnectedException: [172.22.107.21:10000][172.22.107.21:9300] Node not connected
        at org.elasticsearch.transport.TcpTransport.getConnection(TcpTransport.java:640) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.transport.TcpTransport.getConnection(TcpTransport.java:117) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.transport.TransportService.getConnection(TransportService.java:540) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:516) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicasProxy.performOn(TransportReplicationAction.java:1001) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.action.support.replication.ReplicationOperation.performOnReplica(ReplicationOperation.java:185) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.action.support.replication.ReplicationOperation.performOnReplicas(ReplicationOperation.java:169) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.action.support.replication.ReplicationOperation.execute(ReplicationOperation.java:129) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.onResponse(TransportReplicationAction.java:345) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.onResponse(TransportReplicationAction.java:270) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.action.support.replication.TransportReplicationAction$1.onResponse(TransportReplicationAction.java:924) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.action.support.replication.TransportReplicationAction$1.onResponse(TransportReplicationAction.java:921) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.index.shard.IndexShardOperationsLock.acquire(IndexShardOperationsLock.java:151) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.index.shard.IndexShard.acquirePrimaryOperationLock(IndexShard.java:1659) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.action.support.replication.TransportReplicationAction.acquirePrimaryShardReference(TransportReplicationAction.java:933) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.action.support.replication.TransportReplicationAction.access$500(TransportReplicationAction.java:92) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.doRun(TransportReplicationAction.java:291) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:266) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:248) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:654) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.6.7.jar:5.6.7]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_162]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_162]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_162]
[2018-06-07T09:45:01,099][WARN ][o.e.c.a.s.ShardStateAction] [172.22.107.22:10000] [logs][0] received shard failed for shard id [[logs][0]], allocation id [bMPR72sBT-SVI4zdX8dn0Q], primary term [15], message [failed to perform indices:data/write/bulk[s] on replica [logs][0], node[b1rXMHeJTh2LfTN1klzHzA], [R], s[STARTED], a[id=bMPR72sBT-SVI4zdX8dn0Q]], failure [NodeNotConnectedException[[172.22.107.21:10000][172.22.107.21:9300] Node not connected]]
org.elasticsearch.transport.NodeNotConnectedException: [172.22.107.21:10000][172.22.107.21:9300] Node not connected
        at org.elasticsearch.transport.TcpTransport.getConnection(TcpTransport.java:640) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.transport.TcpTransport.getConnection(TcpTransport.java:117) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.transport.TransportService.getConnection(TransportService.java:540) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:516) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicasProxy.performOn(TransportReplicationAction.java:1001) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.action.support.replication.ReplicationOperation.performOnReplica(ReplicationOperation.java:185) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.action.support.replication.ReplicationOperation.performOnReplicas(ReplicationOperation.java:169) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.action.support.replication.ReplicationOperation.execute(ReplicationOperation.java:129)

Topic		Replies	Views
Very weird ES Cluster state problem! Elasticsearch	8	534	July 6, 2017
Cluster health times out Elasticsearch	18	1619	July 6, 2017
Shutdown master means breakdown the cluster's service? Elasticsearch	8	2328	July 6, 2017
ES stuck in Red state, despite the fact that all nodes are in cluster: "ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];]" Elasticsearch	18	10923	January 15, 2019
Cluster Setup 3 Node Cluster problem Elasticsearch	48	2148	August 12, 2019

Could not get cluster status after master node goes down

Related topics