Could not get cluster status after master node goes down

You seem to have a few non-default settings, e.g. gateway.recover_after_nodes and gateway.expected_nodes, that may not be appropriate for a 3-node cluster. How did you arrive at these (and other non-default) values?

As you have indices.ttl.interval in there it looks like you have just reused the 1.7 config. I would recommend removing non-default options and start building it out from there with discovery.zen.minimum_master_nodes and discovery.zen.ping.unicast.hosts set to appropriate values.

Hi Christian,

Thank you for the helpful hint, after removing the "discovery.zen.fd.ping_timeout", "discovery.zen.fd.ping_retries" and "discovery.zen.ping_timeout" the situation get much improved. I don't know what is going on wit ES, when these three are set . Anyway

I could reproduce the problem constantly on my local machine. After removing these 3 from the ES configuration on the cluster of our test einvironment (on customer site), the situation got improved much, but still some times after stopping the master, the problem occurs (localhost:9200/_cluster/health just hang on and not responding), also after starting the crashed node, it doesn't join the other two nodes. I tried the same scenario event with removing all the non-default options as you mentioned. Still some times the problems occurs.

there are 8100 shards in the cluster disributed on three nodes each 8g memory assigned to ES.

Cheers,
Vahid

That is in my opinion far too many shards for a cluster that size, which is probably contributing to the problems. Read this blog post for some guidance on shards and sharding.

Yes I know there are too many shards, however currently there is no easy way to avoid that. The problem is that all these were working fine with the old version (1.7) and now we get into a big problem with the new version. Either we should revert everything back to use 1.7 (a serious problem with customers) or retruct the data model which costs alot of works to change the code...

So I'm trying to somehow make it working in some kind of stable at the moment and in parallel maybe find a better solution.

I've reduced the number of shards to 494 for all three nodes according to this https://www.elastic.co/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster

ES heap size is set to 8GB, so for each node about 8*20=160 shards. All the indexes are almost empty, almost no search, no indexing...

After stopping the master, on the other two nodes some times GET request on "localhost:9200/_cluster/health" get stuck and doesn't respond and sometimes display the wrong information (display number_of_data_nodes 3 which must be 2). Also after starting the stopped node, it doesn't join the cluster.

This is the logs of ES on the restarting node (old master):

22.107.22:9300}], reason [ElasticsearchTimeoutException[java.util.concurrent.TimeoutException: 
Timeout waiting for task.]; nested: TimeoutException[Timeout waiting for task.]; ]
[2018-06-05T14:25:59,993][DEBUG][o.e.a.a.i.e.i.TransportIndicesExistsAction] [172.22.107.20:10000] 
timed out while retrying [indices:admin/exists] after failure (timeout [30s])
[2018-06-05T14:26:05,009][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [172.22.107.20:10000] 
no known master node, scheduling a retry
[2018-06-05T14:26:35,010][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [172.22.107.20:10000] 
timed out while retrying [cluster:monitor/health] after failure (timeout [30s])
[2018-06-05T14:26:35,023][DEBUG][o.e.a.a.i.e.i.TransportIndicesExistsAction] [172.22.107.20:10000] no 
known master node, scheduling a retry
[2018-06-05T14:26:54,323][INFO ][o.e.d.z.ZenDiscovery     ] [172.22.107.20:10000] failed to send join 
request to master [{172.22.107.22:10000}{9-c6QHDEQymQo0bl-3hjxQ}{4y3uWlY6SMGAhe2rbeEE0w} 
{172.22.107.22}{172.22.107.22:9300}], reason 
[ElasticsearchTimeoutException[java.util.concurrent.TimeoutException: Timeout waiting for task.]; 
nested: TimeoutException[Timeout waiting for task.]; ]
[2018-06-05T14:27:05,024][DEBUG][o.e.a.a.i.e.i.TransportIndicesExistsAction] [172.22.107.20:10000] 
timed out while retrying [indices:admin/exists] after failure (timeout [30s])
[2018-06-05T14:27:10,037][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [172.22.107.20:10000] 
no known master node, scheduling a retry
[2018-06-05T14:27:40,038][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [172.22.107.20:10000] 
timed out while retrying [cluster:monitor/health] after failure (timeout [30s])
[2018-06-05T14:27:40,050][DEBUG][o.e.a.a.i.e.i.TransportIndicesExistsAction] [172.22.107.20:10000] no 
known master node, scheduling a retry
[2018-06-05T14:27:49,774][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [172.22.107.20:10000] 
no known master node, scheduling a retry
[2018-06-05T14:27:57,332][INFO ][o.e.d.z.ZenDiscovery     ] [172.22.107.20:10000] failed to send join 
request to master [{172.22.107.22:10000}{9-c6QHDEQymQo0bl-3hjxQ}{4y3uWlY6SMGAhe2rbeEE0w} 
{172.22.107.22}{172.22.107.22:9300}], reason 
[ElasticsearchTimeoutException[java.util.concurrent.TimeoutException: Timeout waiting for task.]; 
nested: TimeoutException[Timeout waiting for task.]; ]
[2018-06-05T14:28:10,051][DEBUG][o.e.a.a.i.e.i.TransportIndicesExistsAction] [172.22.107.20:10000] 
timed out while retrying [indices:admin/exists] after failure (timeout [30s])
[2018-06-05T14:28:15,065][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [172.22.107.20:10000]     
no known master node, scheduling a retry
[2018-06-05T14:28:19,775][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [172.22.107.20:10000] 
timed out while retrying [cluster:monitor/health] after failure (timeout [30s])
[2018-06-05T14:28:19,778][WARN ][r.suppressed             ] path: /_cluster/health, params: {}
org.elasticsearch.discovery.MasterNotDiscoveredException: null
    at 


  org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$4.onTimeout(TransportMasterNodeAction.java:209) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:311) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:238) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.cluster.service.ClusterService$NotifyTimeout.run(ClusterService.java:1056) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) [elasticsearch-5.6.7.jar:5.6.7]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_162]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_162]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_162]

there are 32 GB of memory on each node 14GB is free
java version 1.8.0_162

Why this? It's a total waste of resources to have lot of empty shards...

We must save data for each customer separately in different indexes, otherwise our software won't be certified :frowning:

We must save data for each customer separately in different indexes

Fair enough. May be this strategy will require more data nodes then...
Note that you can use filtered aliases to virtually separate the data.

I think it has nothing to do with number of shards....

I've closed all the indices but one, still the same issue...
after stopping the master, executing the command "curl -XGET localhost:9200/_cluster/health?pretty" on the other two nodes just get stuck and after about 30s return "error 503 master_not_discovered_exception..."

I've reduced the number of shards to 100 on three nodes.

Now cluster is faster, but the discovery problem still is there. By shutting down a master, other two nodes still trying to connect to the gone node and never do a master election although they are also master eligible.

In the logs it says that master is left and lists the available nodes:

master-left

however it tries to connect to it and never give up....

failed-to-connect-to-let-master

Please don't post images of text as they are hardly readable and not searchable.

Instead paste the text and format it with </> icon. Check the preview window.

Could you also share your nodes settings and the master eligible nodes logs?

I used formatting in previous comments, however wanted to higlight some lines and sometimes they are some sensitive informations in the logs which I want to hide them in the images...

this is the nodes settings, same for all three:

discovery.zen.ping.unicast.hosts: ["172.22.107.20:9300","172.22.107.21:9300","172.22.107.22:9300"]
cluster.name: <cluster-name>
discovery.zen.minimum_master_nodes: 2
http.port: 9200
path.data: /opt/bdm4_data
network.host: 0.0.0.0
node.name: 172.22.107.21:10000
transport.tcp.port: 9300
action.auto_create_index: false

this is the log of one of the master-eligible nodes. On both are the same:

 [2018-06-06T17:23:39,029][INFO ][o.e.n.Node               ] [172.22.107.20:10000] started
[2018-06-06T17:24:35,455][INFO ][o.e.c.s.ClusterService   ] [172.22.107.20:10000] added {{172.22.107.21:10000}{b1rXMHeJTh2LfTN1klzHzA}{KCRr-g4WSyOrMO4XUJqwVg}{172.22.107.21}{172.22.107.21:9300},}, reason: zen-disco-receive(from master [master {172.22.107.22:10000}{ds3DRQbwR2qQg7S9x-ljfw}{yTY4XQFsReiv4dxEaT67cA}{172.22.107.22}{172.22.107.22:9300} committed version [46]])
[2018-06-06T17:25:14,044][INFO ][o.e.d.z.ZenDiscovery     ] [172.22.107.20:10000] master_left [{172.22.107.22:10000}{ds3DRQbwR2qQg7S9x-ljfw}{yTY4XQFsReiv4dxEaT67cA}{172.22.107.22}{172.22.107.22:9300}], reason [shut_down]
[2018-06-06T17:25:14,047][WARN ][o.e.d.z.ZenDiscovery     ] [172.22.107.20:10000] master left (reason = shut_down), current nodes: nodes:
   {172.22.107.20:10000}{w49SCPmcRvSsW0_948B_rA}{xsBLwTuoRx6gJ5ew9SzUAw}{172.22.107.20}{172.22.107.20:9300}, local
   {172.22.107.22:10000}{ds3DRQbwR2qQg7S9x-ljfw}{yTY4XQFsReiv4dxEaT67cA}{172.22.107.22}{172.22.107.22:9300}, master
   {172.22.107.21:10000}{b1rXMHeJTh2LfTN1klzHzA}{KCRr-g4WSyOrMO4XUJqwVg}{172.22.107.21}{172.22.107.21:9300}

[2018-06-06T17:25:14,478][DEBUG][o.e.a.a.c.s.TransportClusterStateAction] [172.22.107.20:10000] no known master node, scheduling a retry
[2018-06-06T17:25:14,870][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [172.22.107.20:10000] no known master node, scheduling a retry
[2018-06-06T17:25:16,545][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [172.22.107.20:10000] no known master node, scheduling a retry
[2018-06-06T17:25:17,160][INFO ][o.e.c.s.ClusterService   ] [172.22.107.20:10000] detected_master {172.22.107.21:10000}{b1rXMHeJTh2LfTN1klzHzA}{KCRr-g4WSyOrMO4XUJqwVg}{172.22.107.21}{172.22.107.21:9300}, reason: zen-disco-receive(from master [master {172.22.107.21:10000}{b1rXMHeJTh2LfTN1klzHzA}{KCRr-g4WSyOrMO4XUJqwVg}{172.22.107.21}{172.22.107.21:9300} committed version [77]])
[2018-06-06T17:25:17,170][WARN ][o.e.c.NodeConnectionsService] [172.22.107.20:10000] failed to connect to node {172.22.107.22:10000}{ds3DRQbwR2qQg7S9x-ljfw}{yTY4XQFsReiv4dxEaT67cA}{172.22.107.22}{172.22.107.22:9300} (tried [1] times)
org.elasticsearch.transport.ConnectTransportException: [172.22.107.22:10000][172.22.107.22:9300] connect_timeout[30s]
        at org.elasticsearch.transport.netty4.Netty4Transport.connectToChannels(Netty4Transport.java:363) ~[?:?]
        at org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:570) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.transport.TcpTransport.connectToNode(TcpTransport.java:473) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:342) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:329) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.cluster.NodeConnectionsService.validateAndConnectIfNeeded(NodeConnectionsService.java:154) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.cluster.NodeConnectionsService$1.doRun(NodeConnectionsService.java:107) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.6.7.jar:5.6.7]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_162]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_162]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_162]
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 172.22.107.22/172.22.107.22:9300
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:?]
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) ~[?:?]
        at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:352) ~[?:?]
        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:632) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:544) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458) ~[?:?]
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) ~[?:?]
[2018-06-06T19:10:15,616][DEBUG][o.e.a.a.c.s.TransportClusterStateAction] [172.22.107.20:10000] no known master node, scheduling a retry
[2018-06-06T19:10:15,770][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [172.22.107.20:10000] no known master node, scheduling a retry
[2018-06-06T19:10:17,710][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [172.22.107.20:10000] no known master node, scheduling a retry
[2018-06-06T19:10:18,393][WARN ][o.e.c.NodeConnectionsService] [172.22.107.20:10000] failed to connect to node {172.22.107.21:10000}{b1rXMHeJTh2LfTN1klzHzA}{sJRRQk0IS46vdZmIkMNbFA}{172.22.107.21}{172.22.107.21:9300} (tried [1] times)
org.elasticsearch.transport.ConnectTransportException: [172.22.107.21:10000][172.22.107.21:9300] connect_timeout[30s]
        at org.elasticsearch.transport.netty4.Netty4Transport.connectToChannels(Netty4Transport.java:363) ~[?:?]
        at org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:570) ~[elasticsearch-5.6.7.jar:5.6.7]

When all your nodes are running, could you run:

GET /_cat/nodes?v

Then kill the master node and do that again on whatever node?

this is the output when all the nodes are running:

ip            heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
172.22.107.20            2          54   9    1.48    0.68     0.33 mdi       -      172.22.107.20:10000
172.22.107.22            3          65   9    0.15    0.20     0.17 mdi       -      172.22.107.22:10000
172.22.107.21            2          54  11    0.30    0.26     0.14 mdi       *      172.22.107.21:10000

and this is the output when master is killed:

ip            heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
172.22.107.20            1          54   5    0.88    0.69     0.36 mdi       -      172.22.107.20:10000
172.22.107.22            2          65  17    0.37    0.22     0.18 mdi       *      172.22.107.22:10000

the node respond to the command GET /_cat/node_s?v

but no response to

curl -XGET localhost:9200/_cluster/health?pretty

Also after restarting the killed master, it never join the cluster. This is the log for the killed master after restarting:

[2018-06-07T09:56:40,416][INFO ][o.e.d.DiscoveryModule    ] [172.22.107.21:10000] using discovery type [zen]
[2018-06-07T09:56:41,644][INFO ][o.e.n.Node               ] [172.22.107.21:10000] initialized
[2018-06-07T09:56:41,644][INFO ][o.e.n.Node               ] [172.22.107.21:10000] starting ...
[2018-06-07T09:56:41,835][INFO ][o.e.t.TransportService   ] [172.22.107.21:10000] publish_address {172.22.107.21:9300}, bound_addresses {[::]:9300}
[2018-06-07T09:56:41,859][INFO ][o.e.b.BootstrapChecks    ] [172.22.107.21:10000] bound or publishing to a non-loopback address, enforcing bootstrap checks
[2018-06-07T09:56:45,984][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [172.22.107.21:10000] no known master node, scheduling a retry
[2018-06-07T09:57:11,907][WARN ][o.e.n.Node               ] [172.22.107.21:10000] timed out while waiting for initial discovery state - timeout: 30s
[2018-06-07T09:57:11,920][INFO ][o.e.h.n.Netty4HttpServerTransport] [172.22.107.21:10000] publish_address {172.22.107.21:9200}, bound_addresses {[::]:9200}
[2018-06-07T09:57:11,920][INFO ][o.e.n.Node               ] [172.22.107.21:10000] started
[2018-06-07T09:57:15,988][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [172.22.107.21:10000] timed out while retrying [cluster:monitor/health] after failure (timeout [30s])
[2018-06-07T09:57:16,027][DEBUG][o.e.a.a.i.e.i.TransportIndicesExistsAction] [172.22.107.21:10000] no known master node, scheduling a retry
[2018-06-07T09:57:45,018][INFO ][o.e.d.z.ZenDiscovery     ] [172.22.107.21:10000] failed to send join request to master [{172.22.107.22:10000}{ds3DRQbwR2qQg7S9x-ljfw}{CTegKdEiQh2i2olrKdibrg}{172.22.107.22}{172.22.107.22:9300}], reason [ElasticsearchTimeoutException[java.util.concurrent.TimeoutException: Timeout waiting for task.]; nested: TimeoutException[Timeout waiting for task.]; ]
[2018-06-07T09:57:46,029][DEBUG][o.e.a.a.i.e.i.TransportIndicesExistsAction] [172.22.107.21:10000] timed out while retrying [indices:admin/exists] after failure (timeout [30s])
[2018-06-07T09:57:51,042][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [172.22.107.21:10000] no known master node, scheduling a retry
[2018-06-07T09:58:10,074][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [172.22.107.21:10000] no known master node, scheduling a retry
[2018-06-07T09:58:21,044][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [172.22.107.21:10000] timed out while retrying [cluster:monitor/health] after failure (timeout [30s])
[2018-06-07T09:58:21,058][DEBUG][o.e.a.a.i.e.i.TransportIndicesExistsAction] [172.22.107.21:10000] no known master node, scheduling a retry
[2018-06-07T09:58:40,076][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [172.22.107.21:10000] timed out while retrying [cluster:monitor/health] after failure (timeout [30s])
[2018-06-07T09:58:40,080][WARN ][r.suppressed             ] path: /_cluster/health, params: {pretty=}
org.elasticsearch.discovery.MasterNotDiscoveredException: null
        at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$4.onTimeout(TransportMasterNodeAction.java:209) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:311) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:238) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.cluster.service.ClusterService$NotifyTimeout.run(ClusterService.java:1056) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) [elasticsearch-5.6.7.jar:5.6.7]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_162]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_162]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_162]
[2018-06-07T09:58:48,030][INFO ][o.e.d.z.ZenDiscovery     ] [172.22.107.21:10000] failed to send join request to master [{172.22.107.22:10000}{ds3DRQbwR2qQg7S9x-ljfw}{CTegKdEiQh2i2olrKdibrg}{172.22.107.22}{172.22.107.22:9300}], reason [ElasticsearchTimeoutException[java.util.concurrent.TimeoutException: Timeout waiting for task.]; nested: TimeoutException[Timeout waiting for task.]; ]
[2018-06-07T09:58:51,059][DEBUG][o.e.a.a.i.e.i.TransportIndicesExistsAction] [172.22.107.21:10000] timed out while retrying [indices:admin/exists] after failure (timeout [30s])
[2018-06-07T09:58:56,082][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [172.22.107.21:10000] no known master node, scheduling a retry
[2018-06-07T09:59:26,083][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [172.22.107.21:10000] timed out while retrying [cluster:monitor/health] after failure (timeout [30s])
[2018-06-07T09:59:26,093][DEBUG][o.e.a.a.i.e.i.TransportIndicesExistsAction] [172.22.107.21:10000] no known master node, scheduling a retry
[2018-06-07T09:59:51,040][INFO ][o.e.d.z.ZenDiscovery     ] [172.22.107.21:10000] failed to send join request to master [{172.22.107.22:10000}{ds3DRQbwR2qQg7S9x-ljfw}{CTegKdEiQh2i2olrKdibrg}{172.22.107.22}{172.22.107.22:9300}], reason [ElasticsearchTimeoutException[java.util.concurrent.TimeoutException: Timeout waiting for task.]; nested: TimeoutException[Timeout waiting for task.]; ]
[2018-06-07T09:59:56,094][DEBUG][o.e.a.a.i.e.i.TransportIndicesExistsAction] [172.22.107.21:10000] timed out while retrying [indices:admin/exists] after failure (timeout [30s])
[2018-06-07T10:00:01,109][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [172.22.107.21:10000] no known master node, scheduling a retry

When the master node is killed (step 2), could you run this:

curl -XGET 172.22.107.22:9200/_cluster/health?pretty
curl -XGET 172.22.107.20:9200/_cluster/health?pretty

After node 172.22.107.21 is killed we can clearly see that master node is now 172.22.107.22 which looks fine.

BTW could you share the full logs of the 3 nodes?

Also after restarting the killed master, it never join the cluster.

I suspect something bad on the network side, like a firewall or something.
May be you have something which prevents node 172.22.107.21 to connect to 172.22.107.22 on port 9300?

Could check that? Including trying to telnet on that port from 172.22.107.21?

command

curl -XGET 172.22.107.22:9200/_cluster/health?pretty

doesn't work since for the security reasons 9200 is closed for other ips.

port 9300 is reachable from other nodes, telnet respond.

It's not possible to upload txt files on the forum. I just can copy/paste the logs here which is limited to 7000 characters.

This is the logs for node 172.22.107.22 which is the new master:

[2018-06-07T09:41:04,827][INFO ][o.e.p.PluginsService     ] [172.22.107.22:10000] loaded module [aggs-matrix-stats]
[2018-06-07T09:41:04,828][INFO ][o.e.p.PluginsService     ] [172.22.107.22:10000] loaded module [ingest-common]
[2018-06-07T09:41:04,828][INFO ][o.e.p.PluginsService     ] [172.22.107.22:10000] loaded module [lang-expression]
[2018-06-07T09:41:04,828][INFO ][o.e.p.PluginsService     ] [172.22.107.22:10000] loaded module [lang-groovy]
[2018-06-07T09:41:04,828][INFO ][o.e.p.PluginsService     ] [172.22.107.22:10000] loaded module [lang-mustache]
[2018-06-07T09:41:04,828][INFO ][o.e.p.PluginsService     ] [172.22.107.22:10000] loaded module [lang-painless]
[2018-06-07T09:41:04,828][INFO ][o.e.p.PluginsService     ] [172.22.107.22:10000] loaded module [parent-join]
[2018-06-07T09:41:04,828][INFO ][o.e.p.PluginsService     ] [172.22.107.22:10000] loaded module [percolator]
[2018-06-07T09:41:04,828][INFO ][o.e.p.PluginsService     ] [172.22.107.22:10000] loaded module [reindex]
[2018-06-07T09:41:04,829][INFO ][o.e.p.PluginsService     ] [172.22.107.22:10000] loaded module [transport-netty3]
[2018-06-07T09:41:04,829][INFO ][o.e.p.PluginsService     ] [172.22.107.22:10000] loaded module [transport-netty4]
[2018-06-07T09:41:04,829][INFO ][o.e.p.PluginsService     ] [172.22.107.22:10000] no plugins loaded
[2018-06-07T09:41:07,000][INFO ][o.e.d.DiscoveryModule    ] [172.22.107.22:10000] using discovery type [zen]
[2018-06-07T09:41:08,256][INFO ][o.e.n.Node               ] [172.22.107.22:10000] initialized
[2018-06-07T09:41:08,256][INFO ][o.e.n.Node               ] [172.22.107.22:10000] starting ...
[2018-06-07T09:41:08,481][INFO ][o.e.t.TransportService   ] [172.22.107.22:10000] publish_address {172.22.107.22:9300}, bound_addresses {[:
:]:9300}
[2018-06-07T09:41:08,502][INFO ][o.e.b.BootstrapChecks    ] [172.22.107.22:10000] bound or publishing to a non-loopback address, enforcing
bootstrap checks
[2018-06-07T09:41:12,023][INFO ][o.e.c.s.ClusterService   ] [172.22.107.22:10000] detected_master {172.22.107.21:10000}{b1rXMHeJTh2LfTN1klz
HzA}{00vO7HZuQcq2LfLYZ-zQAg}{172.22.107.21}{172.22.107.21:9300}, added {{172.22.107.21:10000}{b1rXMHeJTh2LfTN1klzHzA}{00vO7HZuQcq2LfLYZ-zQA
g}{172.22.107.21}{172.22.107.21:9300},}, reason: zen-disco-receive(from master [master {172.22.107.21:10000}{b1rXMHeJTh2LfTN1klzHzA}{00vO7H
ZuQcq2LfLYZ-zQAg}{172.22.107.21}{172.22.107.21:9300} committed version [1]])
[2018-06-07T09:41:12,057][INFO ][o.e.h.n.Netty4HttpServerTransport] [172.22.107.22:10000] publish_address {172.22.107.22:9200}, bound_addre
sses {[::]:9200}
[2018-06-07T09:41:12,058][INFO ][o.e.n.Node               ] [172.22.107.22:10000] started
[2018-06-07T09:42:37,811][INFO ][o.e.c.s.ClusterService   ] [172.22.107.22:10000] added {{172.22.107.20:10000}{w49SCPmcRvSsW0_948B_rA}{Vnlk
mgbtQuO793YmrfYDnA}{172.22.107.20}{172.22.107.20:9300},}, reason: zen-disco-receive(from master [master {172.22.107.21:10000}{b1rXMHeJTh2Lf
TN1klzHzA}{00vO7HZuQcq2LfLYZ-zQAg}{172.22.107.21}{172.22.107.21:9300} committed version [45]])
[2018-06-07T09:44:57,367][INFO ][o.e.d.z.ZenDiscovery     ] [172.22.107.22:10000] master_left [{172.22.107.21:10000}{b1rXMHeJTh2LfTN1klzHzA
}{00vO7HZuQcq2LfLYZ-zQAg}{172.22.107.21}{172.22.107.21:9300}], reason [shut_down]
    [2018-06-07T09:44:57,371][WARN ][o.e.d.z.ZenDiscovery     ] [172.22.107.22:10000] master left (reason = shut_down), current nodes: nodes:
       {172.22.107.20:10000}{w49SCPmcRvSsW0_948B_rA}{VnlkmgbtQuO793YmrfYDnA}{172.22.107.20}{172.22.107.20:9300}
       {172.22.107.22:10000}{ds3DRQbwR2qQg7S9x-ljfw}{CTegKdEiQh2i2olrKdibrg}{172.22.107.22}{172.22.107.22:9300}, local
       {172.22.107.21:10000}{b1rXMHeJTh2LfTN1klzHzA}{00vO7HZuQcq2LfLYZ-zQAg}{172.22.107.21}{172.22.107.21:9300}, master

    [2018-06-07T09:44:57,556][DEBUG][o.e.a.a.c.s.TransportClusterStateAction] [172.22.107.22:10000] no known master node, scheduling a retry
    [2018-06-07T09:44:58,284][WARN ][o.e.c.NodeConnectionsService] [172.22.107.22:10000] failed to connect to node {172.22.107.21:10000}{b1rXMH
    eJTh2LfTN1klzHzA}{00vO7HZuQcq2LfLYZ-zQAg}{172.22.107.21}{172.22.107.21:9300} (tried [1] times)
    org.elasticsearch.transport.ConnectTransportException: [172.22.107.21:10000][172.22.107.21:9300] connect_timeout[30s]
            at org.elasticsearch.transport.netty4.Netty4Transport.connectToChannels(Netty4Transport.java:363) ~[?:?]
            at org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:570) ~[elasticsearch-5.6.7.jar:5.6.7]
            at org.elasticsearch.transport.TcpTransport.connectToNode(TcpTransport.java:473) ~[elasticsearch-5.6.7.jar:5.6.7]
            at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:342) ~[elasticsearch-5.6.7.jar:5.6.7]
            at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:329) ~[elasticsearch-5.6.7.jar:5.6.7]
            at org.elasticsearch.cluster.NodeConnectionsService.validateAndConnectIfNeeded(NodeConnectionsService.java:154) [elasticsearch-5.6.
    7.jar:5.6.7]
            at org.elasticsearch.cluster.NodeConnectionsService$ConnectionChecker.doRun(NodeConnectionsService.java:183) [elasticsearch-5.6.7.j
    ar:5.6.7]
            at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elastics
    earch-5.6.7.jar:5.6.7]
            at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.6.7.jar:5.6.7]
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_162]
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_162]
            at java.lang.Thread.run(Thread.java:748) [?:1.8.0_162]
    Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 172.22.107.21/172.22.107.21:9300
            at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:?]
            at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) ~[?:?]
            at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:352) ~[?:?]
[2018-06-07T09:44:59,627][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [172.22.107.22:10000] no known master node, scheduling a retry
[2018-06-07T09:45:00,427][INFO ][o.e.c.s.ClusterService   ] [172.22.107.22:10000] new_master {172.22.107.22:10000}{ds3DRQbwR2qQg7S9x-ljfw}{CTegKdEiQh2i2olrKdibrg}{172.22.107.22}{172.22.107.22:9300}, reason: zen-disco-elected-as-master ([1] nodes joined)[{172.22.107.20:10000}{w49SCPmcRvSsW0_948B_rA}{VnlkmgbtQuO793YmrfYDnA}{172.22.107.20}{172.22.107.20:9300}]
[2018-06-07T09:45:00,486][DEBUG][o.e.a.a.c.s.TransportClusterStateAction] [172.22.107.22:10000] no known master node, scheduling a retry
[2018-06-07T09:45:00,488][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [172.22.107.22:10000] no known master node, scheduling a retry
[2018-06-07T09:45:00,499][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [172.22.107.22:10000] failed to execute on node [b1rXMHeJTh2LfTN1klzHzA]
org.elasticsearch.transport.NodeNotConnectedException: [172.22.107.21:10000][172.22.107.21:9300] Node not connected
        at org.elasticsearch.transport.TcpTransport.getConnection(TcpTransport.java:640) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.transport.TcpTransport.getConnection(TcpTransport.java:117) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.transport.TransportService.getConnection(TransportService.java:540) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:516) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.start(TransportNodesAction.java:197) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.action.support.nodes.TransportNodesAction.doExecute(TransportNodesAction.java:89) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.action.support.nodes.TransportNodesAction.doExecute(TransportNodesAction.java:52) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:170) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:142) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:84) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.client.node.NodeClient.executeLocally(NodeClient.java:83) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.client.node.NodeClient.doExecute(NodeClient.java:72) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:408) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.client.support.AbstractClient$ClusterAdmin.execute(AbstractClient.java:730) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.client.support.AbstractClient$ClusterAdmin.nodesStats(AbstractClient.java:826) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.cluster.InternalClusterInfoService.updateNodeStats(InternalClusterInfoService.java:256) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.cluster.InternalClusterInfoService.refresh(InternalClusterInfoService.java:292) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.cluster.InternalClusterInfoService.maybeRefresh(InternalClusterInfoService.java:277) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.cluster.InternalClusterInfoService.lambda$onMaster$0(InternalClusterInfoService.java:137) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) [elasticsearch-5.6.7.jar:5.6.7]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_162]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_162]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_162]
[2018-06-07T09:45:00,507][DEBUG][o.e.a.a.i.s.TransportIndicesStatsAction] [172.22.107.22:10000] failed to execute [indices:monitor/stats] on node [b1rXMHeJTh2LfTN1klzHzA]
org.elasticsearch.transport.NodeNotConnectedException: [172.22.107.21:10000][172.22.107.21:9300] Node not connected
        at org.elasticsearch.transport.TcpTransport.getConnection(TcpTransport.java:640) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.transport.TcpTransport.getConnection(TcpTransport.java:117) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.transport.TransportService.getConnection(TransportService.java:540) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:503) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$AsyncAction.sendNodeRequest(TransportBroadcastByNodeAction.java:322) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$AsyncAction.start(TransportBroadcastByNodeAction.java:311) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction.doExecute(TransportBroadcastByNodeAction.java:234) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction.doExecute(TransportBroadcastByNodeAction.java:79) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:170) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:142) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:84) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.client.node.NodeClient.executeLocally(NodeClient.java:83) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.client.node.NodeClient.doExecute(NodeClient.java:72) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:408) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.client.support.AbstractClient$IndicesAdmin.execute(AbstractClient.java:1256) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.client.support.AbstractClient$IndicesAdmin.stats(AbstractClient.java:1577) ~[elasticsearch-5.6.7.jar:5.6.7]
        at
[2018-06-07T09:45:00,705][INFO ][o.e.c.r.a.AllocationService] [172.22.107.22:10000] Cluster health status changed from [GREEN] to [YELLOW] (reason: [{172.22.107.21:10000}{b1rXMHeJTh2LfTN1klzHzA}{00vO7HZuQcq2LfLYZ-zQAg}{172.22.107.21}{172.22.107.21:9300} transport disconnected]).
[2018-06-07T09:45:00,706][INFO ][o.e.c.s.ClusterService   ] [172.22.107.22:10000] removed {{172.22.107.21:10000}{b1rXMHeJTh2LfTN1klzHzA}{00vO7HZuQcq2LfLYZ-zQAg}{172.22.107.21}{172.22.107.21:9300},}, reason: zen-disco-node-failed({172.22.107.21:10000}{b1rXMHeJTh2LfTN1klzHzA}{00vO7HZuQcq2LfLYZ-zQAg}{172.22.107.21}{172.22.107.21:9300}), reason(transport disconnected)[{172.22.107.21:10000}{b1rXMHeJTh2LfTN1klzHzA}{00vO7HZuQcq2LfLYZ-zQAg}{172.22.107.21}{172.22.107.21:9300} transport disconnected]
[2018-06-07T09:45:01,092][WARN ][o.e.a.b.TransportShardBulkAction] [172.22.107.22:10000] [[logs][0]] failed to perform indices:data/write/bulk[s] on replica [logs][0], node[b1rXMHeJTh2LfTN1klzHzA], [R], s[STARTED], a[id=bMPR72sBT-SVI4zdX8dn0Q]
org.elasticsearch.transport.NodeNotConnectedException: [172.22.107.21:10000][172.22.107.21:9300] Node not connected
        at org.elasticsearch.transport.TcpTransport.getConnection(TcpTransport.java:640) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.transport.TcpTransport.getConnection(TcpTransport.java:117) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.transport.TransportService.getConnection(TransportService.java:540) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:516) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicasProxy.performOn(TransportReplicationAction.java:1001) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.action.support.replication.ReplicationOperation.performOnReplica(ReplicationOperation.java:185) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.action.support.replication.ReplicationOperation.performOnReplicas(ReplicationOperation.java:169) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.action.support.replication.ReplicationOperation.execute(ReplicationOperation.java:129) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.onResponse(TransportReplicationAction.java:345) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.onResponse(TransportReplicationAction.java:270) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.action.support.replication.TransportReplicationAction$1.onResponse(TransportReplicationAction.java:924) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.action.support.replication.TransportReplicationAction$1.onResponse(TransportReplicationAction.java:921) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.index.shard.IndexShardOperationsLock.acquire(IndexShardOperationsLock.java:151) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.index.shard.IndexShard.acquirePrimaryOperationLock(IndexShard.java:1659) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.action.support.replication.TransportReplicationAction.acquirePrimaryShardReference(TransportReplicationAction.java:933) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.action.support.replication.TransportReplicationAction.access$500(TransportReplicationAction.java:92) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.doRun(TransportReplicationAction.java:291) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:266) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:248) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:654) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.6.7.jar:5.6.7]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_162]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_162]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_162]
[2018-06-07T09:45:01,099][WARN ][o.e.c.a.s.ShardStateAction] [172.22.107.22:10000] [logs][0] received shard failed for shard id [[logs][0]], allocation id [bMPR72sBT-SVI4zdX8dn0Q], primary term [15], message [failed to perform indices:data/write/bulk[s] on replica [logs][0], node[b1rXMHeJTh2LfTN1klzHzA], [R], s[STARTED], a[id=bMPR72sBT-SVI4zdX8dn0Q]], failure [NodeNotConnectedException[[172.22.107.21:10000][172.22.107.21:9300] Node not connected]]
org.elasticsearch.transport.NodeNotConnectedException: [172.22.107.21:10000][172.22.107.21:9300] Node not connected
        at org.elasticsearch.transport.TcpTransport.getConnection(TcpTransport.java:640) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.transport.TcpTransport.getConnection(TcpTransport.java:117) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.transport.TransportService.getConnection(TransportService.java:540) ~[elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:516) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicasProxy.performOn(TransportReplicationAction.java:1001) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.action.support.replication.ReplicationOperation.performOnReplica(ReplicationOperation.java:185) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.action.support.replication.ReplicationOperation.performOnReplicas(ReplicationOperation.java:169) [elasticsearch-5.6.7.jar:5.6.7]
        at org.elasticsearch.action.support.replication.ReplicationOperation.execute(ReplicationOperation.java:129)