When I add a new node to my Elasticsearch cluster, the cluster will rebalance shards and shards will move onto the new node. When I want to remove a node from the cluster, I will force shards off the node before decommissioning it. During both of these operations where shards are moving, index requests are still being made to the cluster. Occasionally some shards will take quite a long time to move (say, ~30-45 minutes) and indexing requests during this time will be rejected with a ShardNotInPrimaryModeException that looks like the following
java.lang.Exception: RemoteTransportException[[{NODE_NAME}][{NODE_IP}:9300][indices:data/write/bulk[s]]]; nested: RemoteTransportException[[{NODE_NAME}][{NODE_IP}:9300][indices:data/write/bulk[s][p]]]; nested: RetryOnPrimaryException[shard is not in primary mode]; nested: ShardNotInPrimaryModeException[CurrentState[STARTED] shard is not in primary mode];
From the exception name, this implies that Elasticsearch is attempting to index to a shard while it is not considered the primary. Perhaps when a primary shard is being relocating its primary status changes?
I haven't found a lot of documentation or discussion about this online. I briefly skimmed the ES source code where this exception gets thrown, and, similar to what the name implies, it appears there is some internal state that Elasticsearch has that is at odds with the indexing operation that is attempting to take place.
Is this due to an error on my end (i.e.client error) or an error on the ES side (e.g. some flavor of an IllegalStateException)? Are there operations that I should be taking to avoid or mitigate these errors from occurring? I'm also a bit surprised that it takes so long for some of these shards to move. The shards in the cluster vary in size, but the culprits seem to be around 30gb, which is within the target size for a shard
For reference, I'm running ES 7.6.2
Thanks in advance!