Elasticsearch 6.3.0 doesn't retry on index replica bulk write failure

(original discussion on github - https://github.com/elastic/elasticsearch/issues/39259)

We are running a number of large-scale Elasticsearch clusters (40+ nodes, TBs of data, ingestion and search rates in 100-1000s per second) on ES version 6.3.0. When doing rolling upgrades of the individual nodes (that include a machine reboot) we have observed brief network disconnects of unrelated nodes. E.g. if node1 is being rebooted, ES on node25 would throw socket closed exceptions unrelated to connections to node1. As a direct impact, a number of bulk indexing operations on node25 would fail on node's shard replicas and master would mark the corresponding shards as stale:

failing shard [failed shard, shard [redacted][5], node[GLG7S_x_Sa25ey3vjmX_PA], [R], s[STARTED], a[id=3jAVAHveT1eAsf75zN4hmw], message [failed to perform indices:data/write/bulk[s] on replica [redacted][5]
2019-02-22T12:32:38,022][WARN ][o.e.c.r.a.AllocationService] [es-master-3.localdomain] failing shard [failed shard, shard [redacted][5], node[GLG7S_x_Sa25ey3vjmX_PA], [R], s[STARTED], a[id=3jAVAHveT1eAsf75zN4hmw], message [failed to perform indices:data/write/bulk[s] on replica [redacted][5], node[GLG7S_x_Sa25ey3vjmX_PA], [R], s[STARTED], a[id=3jAVAHveT1eAsf75zN4hmw]], failure [NodeNotConnectedException[[node25.localdomain][redacted:9300] Node not connected]], markAsStale [true]]

As discussed on the github issue, we do not see any retries of those replica bulk indexing write operations. It seems as if the node owning the primary shard will try to write to replica once and then give up. The end result is a loss of a number of replicas, lots of shard rebalancing and prolonged yellow cluster state.

Is there a functional reason for the bulk replica writes not to be more robust? Giving up after a single retry seems rather flakey.

Hi @andrejbl, thanks for continuing the discussion here.

A network disconnection occurs under the following situations:

  • the connection is dropped at the OS level after some number of unacknowledged retransmissions (15 by default on Linux)
  • the remote node repeatedly failed to respond to application-level health checks (3 times by default, and only on connections to or from the master node)
  • the connection was actively terminated by the remote node, for instance because it is shutting down

In all cases it makes sense to treat a network disconnection as a fatal error and not to retry further. The alternative is to queue up requests for the remote node in the hope that it reconnects soon, but this is normally not a helpful thing to do.

This is unexpected and warrants further investigation, as I think this sits at the heart of the problem you were describing on GitHub. Can you share the logs surrounding this event? I'd like to see the logs from:

  • the elected master node
  • the rebooting node (node1 above)
  • the node with the exceptions (node25 above)
  • the peers from which it disconnected.

They're probably quite large so perhaps use https://gist.github.com/. It would also be good to see the output of GET _nodes/stats to help correlate node IDs/IP addresses/etc.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.