Node not discovering master node properly, "Cluster state has not been recovered yet, cannot write to the [null] index" error [503]

Hello, I have 2 servers in 2 different locations I am trying to make into a cluster. Both are running Elasticsearch v8.1.3.

I have the master server, which is configured like this:

cluster.name: yyz-news-prod
node.name: node-yyz-1
cluster.initial_master_nodes: ["node-yyz-1"]
discovery.seed_hosts:
   - xx.xx.xxx.x
   - yy.yyy.yyy.yy

xpack.security.enrollment.enabled: true
xpack.security.http.ssl:
  enabled: true
  keystore.path: certs/http.p12
xpack.security.transport.ssl:
  enabled: true
  verification_mode: certificate
  keystore.path: certs/transport.p12
  truststore.path: certs/transport.p12
http.host: [_local_, _site_]

And then the second server:

cluster.name: yyz-news-prod
node.name: node-yyz-2
cluster.initial_master_nodes: ["node-yyz-1"]
discovery.seed_hosts:
   - xx.xx.xxx.x
   - yy.yyy.yyy.yy

xpack.security.enrollment.enabled: true
xpack.security.http.ssl:
  enabled: true
  keystore.path: certs/http.p12
xpack.security.transport.ssl:
  enabled: true
  verification_mode: certificate
  keystore.path: certs/transport.p12
  truststore.path: certs/transport.p12
http.host: [_local_, _site_]

I started with a fresh install for both servers, launched the master server first, then launched the second slave server. However when I curl --insecure https://localhost:9200/_cluster/health?pretty on my master server I see only 1 nodes connected:

"cluster_name" : "yyz-news-prod",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 2,
  "active_shards" : 2,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0

When I do the same on the slave server with the autogenerated slave password for the elastic user I get the following 503 error:

{
  "error" : {
    "root_cause" : [
      {
        "type" : "status_exception",
        "reason" : "Cluster state has not been recovered yet, cannot write to the [null] index"
      }
    ],
    "type" : "authentication_processing_error",
    "reason" : "failed to promote the auto-configured elastic password hash",
    "caused_by" : {
      "type" : "status_exception",
      "reason" : "Cluster state has not been recovered yet, cannot write to the [null] index"
    }
  },
  "status" : 503
}

I don't know why my slave node cannot connect to and find my master node? They are both portforwarded on 9200 and 9300 so I don't think it's a networking issue. Any suggestions?

Can you take a look at the logs on the server side, they should be more helpful, than that error message returned to the client.

Thanks!

Hello, thank you for the response and sorry for the late reply.

The output of the /var/log/elasticsearch/my-cluster-name.log seems to be repeating the following line until the client disconnects from the master server:

[2022-05-02T01:32:11,627][WARN ][o.e.x.c.s.t.n.SecurityNetty4Transport] [node-yyz-1] client did not trust this server's certificate, closing connection Netty4TcpChannel{localAddress=/10.0.0.207:9300, remoteAddress=/yy.yyy.yyy.yy:58074, profile=default}

I did not configure the certificates of either node, I just installed them via the apt package manager and they auto configured the security settings.

See: Reconfigure a node to join an existing cluster