We are using elastic search 7.6.0 as a package from the DCOS universe (3 master nodes, 6 data nodes, 1 ingest and 1 coordinator node) with xpack enabled (using native realm). We ran into an issue when the DCOS cluster was intentionally brought down for a couple days and on cluster restart, the elastic cluster came back up but without the security index. To replicate the issue, we replaced the masters (replacing would delete all persistent configuration). Once replaced, 2 of the master nodes came back up with this error -
[master-0-node] security index is unavailable. short circuiting retrieval of user
and the 3rd master node had this -
ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];]
We tried regenerating the built in user password by running elasticsearch-setup-passwords but it failed because the cluster was not healthy -
Unexpected response code [503] from calling PUT https://master-0-node.myelastic.autoip.dcos.thisdcos.directory:1027/_security/user/apm_system/_password?pretty
Cause: Cluster state has not been recovered yet, cannot write to the [null] index
Would appreciate any pointers to suggest what is going wrong and what needs to be done. From the looks of it, it doesn't seem to be a dcos issue but something wrong in how we are restarting the masters, but can't quite figure out what is going wrong.