I've got a few somewhat theoretical questions about replica shard promotion in the face of node failure. All questions are relative to version 2.4.4 and 6.2+ (as my next upgrade will be from the former to the latter).
First, what log entries would indicate that a replica was promoted to primary? What log level needs to be enabled to see this, either on the master and/or on a data node? Are there other indications of promotion, successful or otherwise? As things stand now, when I test node failure, I have yet to see in the cluster logs any indication of successful promotion of a replica.
Second, under what configuration conditions should promotion occur? What configuration directives are necessary and sufficient for promotion to occur?
Lastly, I read this recently so I'd like a bit more color here:
An up-to-date shard copy of the data cannot be found on the current data nodes in the cluster. To prevent data loss, the system does not automatically promote a stale shard copy to primary.
What is the best way to identify whether I have a stale shard that cannot be promoted? Could slow cluster state updates/propagation cause shards to become stale, especially under load?
I would love to get a bit more clarity around this situation as searches through Github issues found few related to promotion. I'm troubleshooting a situation and trying to discern whether forcing failure of a node in my test environment actually leads to successful replica promotion. I would expect my cluster status to turn RED for up to 30 seconds but not the ~ 120 seconds which really occurs (at which point the "failed" node has re-joined the cluster).
Thank you in advance for any insight offered.