We have a ES cluster with 10 data nodes and 3 master nodes. After few days of data flowing into the cluster we get to a point where we would have 20000 primary shards (60,000 total shards with 2 replication factor). Under this condition, if we bring down and node or 2 for upgrade/planned maintenance, roughly 6000 shards get into an unassigned state. Also, the cluster gets busy reallocating the shards. During that time we get lot of timeouts to our indexeing API's and we end up losing data.
Is there a cleaner way of handling the upgrade scenario without impacting indexing performance of the cluster?