(original discussion on github - https://github.com/elastic/elasticsearch/issues/39259)
We are running a number of large-scale Elasticsearch clusters (40+ nodes, TBs of data, ingestion and search rates in 100-1000s per second) on ES version 6.3.0. When doing rolling upgrades of the individual nodes (that include a machine reboot) we have observed brief network disconnects of unrelated nodes. E.g. if node1 is being rebooted, ES on node25 would throw socket closed exceptions unrelated to connections to node1. As a direct impact, a number of bulk indexing operations on node25 would fail on node's shard replicas and master would mark the corresponding shards as stale:
failing shard [failed shard, shard [redacted][5], node[GLG7S_x_Sa25ey3vjmX_PA], [R], s[STARTED], a[id=3jAVAHveT1eAsf75zN4hmw], message [failed to perform indices:data/write/bulk[s] on replica [redacted][5]
2019-02-22T12:32:38,022][WARN ][o.e.c.r.a.AllocationService] [es-master-3.localdomain] failing shard [failed shard, shard [redacted][5], node[GLG7S_x_Sa25ey3vjmX_PA], [R], s[STARTED], a[id=3jAVAHveT1eAsf75zN4hmw], message [failed to perform indices:data/write/bulk[s] on replica [redacted][5], node[GLG7S_x_Sa25ey3vjmX_PA], [R], s[STARTED], a[id=3jAVAHveT1eAsf75zN4hmw]], failure [NodeNotConnectedException[[node25.localdomain][redacted:9300] Node not connected]], markAsStale [true]]
As discussed on the github issue, we do not see any retries of those replica bulk indexing write operations. It seems as if the node owning the primary shard will try to write to replica once and then give up. The end result is a loss of a number of replicas, lots of shard rebalancing and prolonged yellow cluster state.
Is there a functional reason for the bulk replica writes not to be more robust? Giving up after a single retry seems rather flakey.