Shard relocation keeps restarting?

Three nodes, 1,2,3, replacing one of them, so added 4, and told the system not to allocate anything to 3. What it's now trying to do is move two shards off 3, these are non-trivial in size (around 5Gbyte) and the link isn't fast (around 40Mbps).

What happens as observed from

GET _cat/recovery?v&h=i,s,t,ty,st,shost,thost,f,fp,b,bp&s=st:desc&active_only=true

is that these shards make progress up to around 20% - 30% and then appear to restart, with progress dropping back to zero.

Over and over again. For hours and hours and hours and hours.

There's nothing at all in the sending log sometimes when this happens, but from time to time (not always in sync with the relocations failing) it says things like

[2018-10-26T15:56:07,264][WARN ][o.e.t.n.Netty4Transport  ] [dev-monitor-3] send message failed [channel: NettyTcpChannel{localAddress=/172.31.11.57:9300, remoteAddress=/172.16.1.205:33542}]
java.nio.channels.ClosedChannelException: null
        at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2018-10-26T15:56:07,428][INFO ][o.e.d.z.ZenDiscovery     ] [dev-monitor-3] master_left [{dev-monitor-1}{M6uY-xHjQS250KMdYB2fHA}{MXc_8PjgREGaMdzN8bjwwQ}{172.16.2.38}{172.16.2.38:9300}], reason [failed to ping, tried [3] times, each with  maximum [30s] timeout]
[2018-10-26T15:56:07,428][WARN ][o.e.d.z.ZenDiscovery     ] [dev-monitor-3] master left (reason = failed to ping, tried [3] times, each with  maximum [30s] timeout), current nodes: nodes:
   {dev-monitor-4}{heLU-gDkRXug-S3Cexx1vw}{aThXvdV3R7ORitXzw2BtSA}{172.16.2.64}{172.16.2.64:9300}{ml.machine_memory=33729298432, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}
   {dev-monitor-3}{Bn33dAGxTseq4VPE48gr-A}{BZZS7yVFSHaKewRxkO_jyA}{172.31.11.57}{172.31.11.57:9300}, local
   {dev-monitor-2}{oRJuzLBKRumqstfiLicChw}{b2YB24yEQDCxb5LELMwf_A}{172.16.1.205}{172.16.1.205:9300}
   {dev-monitor-1}{M6uY-xHjQS250KMdYB2fHA}{MXc_8PjgREGaMdzN8bjwwQ}{172.16.2.38}{172.16.2.38:9300}, master

[2018-10-26T15:56:10,886][INFO ][o.e.c.s.ClusterApplierService] [dev-monitor-3] detected_master {dev-monitor-1}{M6uY-xHjQS250KMdYB2fHA}{MXc_8PjgREGaMdzN8bjwwQ}{172.16.2.38}{172.16.2.38:9300}, reason: apply cluster state (from master [master {dev-monitor-1}{M6uY-xHjQS250KMdYB2fHA}{MXc_8PjgREGaMdzN8bjwwQ}{172.16.2.38}{172.16.2.38:9300} committed version [5177850]])

This looks like a connectivity problem. dev-monitor-3 sent three consecutive pings to the master (dev-monitor-1) each of which received no response within 30 seconds. Also one of the channels between dev-monitor-3 and dev-monitor-2 was closed. I'd expect there to be messages in the master node's logs too, indicating that dev-monitor-3 temporarily left the cluster, which would cancel the ongoing recoveries.

It's possible the recovery is consuming the node's entire bandwidth and preventing higher-priority traffic like pings from getting through soon enough, particularly if there is a device with an excessively large buffer somewhere in the way. The default for indices.recovery.max_bytes_per_sec is 40MBps (megabytes per second) so if you only have a 40Mbps link (megabits per second) then this could explain it. Try reducing indices.recovery.max_bytes_per_sec to something compatible with your network (e.g. 4mb == 4 megabytes per second == 32 megabits per second) and see if this gives more stability.

It's possible it's something else too, but this'd be my first guess.

1 Like

Also, I'm curious, 40Mbps is a pretty narrow pipe these days. What's the story there? Are your nodes connected by satellite, for instance?

After the weekend it's still doing it. I'm going to accept the loss of those shards and close down the node I'm trying to get rid of.

I doubt it's bandwidth as I reduced the number of concurrent recoveries to one and reduced the bandwidth used by the recovery to well under the link's capacity. At which point the single recovery got about twice as far (in percentage terms) as when I was running two at once. So I'm suspecting it's a time limit of some sort, maybe a firewall dropping a TCP connection or something like that. I'm not intending to investigate further.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.