Part 2.
These are the corresponding master node logs during this time frame starting from the intial set to none.
[2019-10-10T17:56:59,070][INFO ][o.e.c.s.ClusterSettings ] [bpcw-node-0] updating [cluster.routing.allocation.enable] from [all] to [none]
[2019-10-10T17:57:06,536][INFO ][o.e.m.j.JvmGcMonitorService] [bpcw-node-0] [gc][15074] overhead, spent [265ms] collecting in the last [1s]
[2019-10-10T17:57:25,378][WARN ][o.e.i.f.SyncedFlushService] [bpcw-node-0] [logstash-2019.10.10][1] can't to issue sync id [1CCLb_xdRmWoaoeLUDsLUg] for out of sync replica [[logstash-2019.10.10][1], node[JOjvG-OgQB6CbLubR-7CMw], [R], s[STARTED], a[id=ZHYr3o0OQOKdmRpy0TGs1g]] with num docs [120375]; num docs on primary [120355]
[2019-10-10T18:00:24,536][WARN ][o.e.a.b.TransportShardBulkAction] [bpcw-node-0] [[filebeat-6.7.1-2019.10.10][0]] failed to perform indices:data/write/bulk[s] on replica [filebeat-6.7.1-2019.10.10][0], node[ox1nA_KUSdGhAYL0KtJYTA], [R], s[STARTED], a[id=_h20wRHyTlKTxC0UA_H8JQ]
org.elasticsearch.transport.NodeDisconnectedException: [bpcw-node-2][10.176.16.128:9300][indices:data/write/bulk[s][r]] disconnected
[2019-10-10T18:00:24,903][INFO ][o.e.c.s.MasterService ] [bpcw-node-0] zen-disco-node-left({bpcw-node-2}{ox1nA_KUSdGhAYL0KtJYTA}{4hD5t1ODQiipG5Ynwelf3Q}{10.176.16.128}{10.176.16.128:9300}), reason(left)[{bpcw-node-2}{ox1nA_KUSdGhAYL0KtJYTA}{4hD5t1ODQiipG5Ynwelf3Q}{10.176.16.128}{10.176.16.128:9300} left], reason: removed {{bpcw-node-2}{ox1nA_KUSdGhAYL0KtJYTA}{4hD5t1ODQiipG5Ynwelf3Q}{10.176.16.128}{10.176.16.128:9300},}
[2019-10-10T18:00:24,967][INFO ][o.e.c.s.ClusterApplierService] [bpcw-node-0] removed {{bpcw-node-2}{ox1nA_KUSdGhAYL0KtJYTA}{4hD5t1ODQiipG5Ynwelf3Q}{10.176.16.128}{10.176.16.128:9300},}, reason: apply cluster state (from master [master {bpcw-node-0}{mbL724baT0Keoq3KgbLo4w}{LN3f8aHRRuiBnCHjv3Ei9w}{10.176.16.115}{10.176.16.115:9300} committed version [16759] source [zen-disco-node-left({bpcw-node-2}{ox1nA_KUSdGhAYL0KtJYTA}{4hD5t1ODQiipG5Ynwelf3Q}{10.176.16.128}{10.176.16.128:9300}), reason(left)[{bpcw-node-2}{ox1nA_KUSdGhAYL0KtJYTA}{4hD5t1ODQiipG5Ynwelf3Q}{10.176.16.128}{10.176.16.128:9300} left]]])
[2019-10-10T18:00:24,988][INFO ][o.e.c.r.DelayedAllocationService] [bpcw-node-0] scheduling reroute for delayed shards in [59.4s] (342 delayed shards)
[2019-10-10T18:00:24,994][WARN ][o.e.c.r.a.AllocationService] [bpcw-node-0] [logstash-2019.10.10][2] marking unavailable shards as stale: [TK-zMYJcQbmW1caUGc1X_Q]
[2019-10-10T18:00:24,995][WARN ][o.e.c.r.a.AllocationService] [bpcw-node-0] [filebeat-6.7.1-2019.10.10][0] marking unavailable shards as stale: [_h20wRHyTlKTxC0UA_H8JQ]
[2019-10-10T18:00:24,995][WARN ][o.e.c.r.a.AllocationService] [bpcw-node-0] [filebeat-6.7.1-2019.10.10][2] marking unavailable shards as stale: [Deu2-x7_R9ul7cXJyeulIg]
[2019-10-10T18:00:25,311][INFO ][o.e.m.j.JvmGcMonitorService] [bpcw-node-0] [gc][15272] overhead, spent [339ms] collecting in the last [1s]
[2019-10-10T18:00:26,353][WARN ][o.e.t.OutboundHandler ] [bpcw-node-0] send message failed [channel: Netty4TcpChannel{localAddress=0.0.0.0/0.0.0.0:9300, remoteAddress=/10.176.16.128:41960}]
java.nio.channels.ClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2019-10-10T18:00:27,359][WARN ][o.e.t.OutboundHandler ] [bpcw-node-0] send message failed [channel: Netty4TcpChannel{localAddress=0.0.0.0/0.0.0.0:9300, remoteAddress=/10.176.16.128:41954}]
java.nio.channels.ClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2019-10-10T18:00:28,249][WARN ][o.e.c.r.a.AllocationService] [bpcw-node-0] [logstash-2019.10.10][0] marking unavailable shards as stale: [Yn4JSvWpR_CggQ5GqjsRww]
[2019-10-10T18:01:36,637][INFO ][o.e.c.s.MasterService ] [bpcw-node-0] zen-disco-node-join[{bpcw-node-2}{ox1nA_KUSdGhAYL0KtJYTA}{3AL0nv_STPK-YMLiGmBEQQ}{10.176.16.128}{10.176.16.128:9300}], reason: added {{bpcw-node-2}{ox1nA_KUSdGhAYL0KtJYTA}{3AL0nv_STPK-YMLiGmBEQQ}{10.176.16.128}{10.176.16.128:9300},}
[2019-10-10T18:02:06,677][WARN ][o.e.d.z.PublishClusterStateAction] [bpcw-node-0] timed out waiting for all nodes to process published state [16763] (timeout [30s], pending nodes: [{bpcw-node-2}{ox1nA_KUSdGhAYL0KtJYTA}{3AL0nv_STPK-YMLiGmBEQQ}{10.176.16.128}{10.176.16.128:9300}])
[2019-10-10T18:02:06,677][INFO ][o.e.c.s.ClusterApplierService] [bpcw-node-0] added {{bpcw-node-2}{ox1nA_KUSdGhAYL0KtJYTA}{3AL0nv_STPK-YMLiGmBEQQ}{10.176.16.128}{10.176.16.128:9300},}, reason: apply cluster state (from master [master {bpcw-node-0}{mbL724baT0Keoq3KgbLo4w}{LN3f8aHRRuiBnCHjv3Ei9w}{10.176.16.115}{10.176.16.115:9300} committed version [16763] source [zen-disco-node-join[{bpcw-node-2}{ox1nA_KUSdGhAYL0KtJYTA}{3AL0nv_STPK-YMLiGmBEQQ}{10.176.16.128}{10.176.16.128:9300}]]])
[2019-10-10T18:02:06,701][WARN ][o.e.c.s.MasterService ] [bpcw-node-0] cluster state update task [zen-disco-node-join[{bpcw-node-2}{ox1nA_KUSdGhAYL0KtJYTA}{3AL0nv_STPK-YMLiGmBEQQ}{10.176.16.128}{10.176.16.128:9300}]] took [30s] above the warn threshold of 30s
After this I set cluster.routing.allocation.enable to all. All nodes register this and there are no more errors. The cluster state looks like this.
{
"cluster_name": "bpcw-dev-cluster",
"status": "yellow",
"timed_out": false,
"number_of_nodes": 3,
"number_of_data_nodes": 3,
"active_primary_shards": 513,
"active_shards": 684,
"relocating_shards": 0,
"initializing_shards": 2,
"unassigned_shards": 340,
"delayed_unassigned_shards": 0,
"number_of_pending_tasks": 1,
"number_of_in_flight_fetch": 954,
"task_max_waiting_in_queue_millis": 0,
"active_shards_percent_as_number": 66.66666666666666
}
The config files are programmatically built, but this is an example of elasticsearch.yml
cluster.name: bpcw-dev-cluster
node.name: bpcw-node-2
network.host: 127.0.0.1
transport.host: 10.176.16.128
http.port: 27959
discovery.type: zen
discovery.zen.ping.unicast.hosts: elasticsearch-cluster.service.consul
discovery.zen.minimum_master_nodes: 2
discovery.zen.fd.ping_timeout: 121s
path.data: /usr/share/elasticsearch/data
The expected behavior is there should be no initializing shards, or unassigned shards. The cluster eventually returns to green, but the entire benefit of portworx is lost.