After some coordinators node left from Elastic cluster and return we start to see this erros on the logs:
org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];
.
And
[2020-06-03T16:26:48,311][ERROR][o.e.x.s.a.e.NativeUsersStore] [elh-bhs-ovh017p] security index is unavailable. short circuiting retrieval of user [readall]
.
We are tryint to stabilish a cluster again since than with all nodes and just with master nodes too. We set some gateway
params on our confs to a small number to start the recovery sooner but the result is the same.
Seems like even after master election and some data nodes join the cluster the recovery doesn't happen. There is some election retry even when a master already exists, who keeps failing non stop.
Cluster stats, health and settings: https://gist.github.com/tiagokrebs/47e8387894c983b7533b46c62246176b/raw/6f4a9fb886aa6048ceadbf59b0e862ddd1a2ea57/stats
elasticsearch.yml: https://gist.github.com/tiagokrebs/47e8387894c983b7533b46c62246176b/raw/6f4a9fb886aa6048ceadbf59b0e862ddd1a2ea57/elasticsearch.yml
master 1 logs: https://gist.github.com/tiagokrebs/47e8387894c983b7533b46c62246176b/raw/6f4a9fb886aa6048ceadbf59b0e862ddd1a2ea57/master_1_log
master 2 logs: https://gist.github.com/tiagokrebs/47e8387894c983b7533b46c62246176b/raw/6f4a9fb886aa6048ceadbf59b0e862ddd1a2ea57/master_2_log
If you guys can contribute with some insight would be very appreciated.