Three nodes, 1,2,3, replacing one of them, so added 4, and told the system not to allocate anything to 3. What it's now trying to do is move two shards off 3, these are non-trivial in size (around 5Gbyte) and the link isn't fast (around 40Mbps).
What happens as observed from
GET _cat/recovery?v&h=i,s,t,ty,st,shost,thost,f,fp,b,bp&s=st:desc&active_only=true
is that these shards make progress up to around 20% - 30% and then appear to restart, with progress dropping back to zero.
Over and over again. For hours and hours and hours and hours.
There's nothing at all in the sending log sometimes when this happens, but from time to time (not always in sync with the relocations failing) it says things like
[2018-10-26T15:56:07,264][WARN ][o.e.t.n.Netty4Transport ] [dev-monitor-3] send message failed [channel: NettyTcpChannel{localAddress=/172.31.11.57:9300, remoteAddress=/172.16.1.205:33542}]
java.nio.channels.ClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2018-10-26T15:56:07,428][INFO ][o.e.d.z.ZenDiscovery ] [dev-monitor-3] master_left [{dev-monitor-1}{M6uY-xHjQS250KMdYB2fHA}{MXc_8PjgREGaMdzN8bjwwQ}{172.16.2.38}{172.16.2.38:9300}], reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
[2018-10-26T15:56:07,428][WARN ][o.e.d.z.ZenDiscovery ] [dev-monitor-3] master left (reason = failed to ping, tried [3] times, each with maximum [30s] timeout), current nodes: nodes:
{dev-monitor-4}{heLU-gDkRXug-S3Cexx1vw}{aThXvdV3R7ORitXzw2BtSA}{172.16.2.64}{172.16.2.64:9300}{ml.machine_memory=33729298432, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}
{dev-monitor-3}{Bn33dAGxTseq4VPE48gr-A}{BZZS7yVFSHaKewRxkO_jyA}{172.31.11.57}{172.31.11.57:9300}, local
{dev-monitor-2}{oRJuzLBKRumqstfiLicChw}{b2YB24yEQDCxb5LELMwf_A}{172.16.1.205}{172.16.1.205:9300}
{dev-monitor-1}{M6uY-xHjQS250KMdYB2fHA}{MXc_8PjgREGaMdzN8bjwwQ}{172.16.2.38}{172.16.2.38:9300}, master
[2018-10-26T15:56:10,886][INFO ][o.e.c.s.ClusterApplierService] [dev-monitor-3] detected_master {dev-monitor-1}{M6uY-xHjQS250KMdYB2fHA}{MXc_8PjgREGaMdzN8bjwwQ}{172.16.2.38}{172.16.2.38:9300}, reason: apply cluster state (from master [master {dev-monitor-1}{M6uY-xHjQS250KMdYB2fHA}{MXc_8PjgREGaMdzN8bjwwQ}{172.16.2.38}{172.16.2.38:9300} committed version [5177850]])