Problems Joining a Cluster

Hi Everyone,
I think I posted my first post in the wrong elasticsearch category. I'm trying to get my new cluster up and working and I'm really struggling with order of operation. I have tried a whole bunch of things but nothing works.
I have a master node that is online and waiting for requests to join/enroll.
the master node successfully generates enrollment tokens. All my nodes can contact each other and are on the same L2 network.
The process of enrolling a new node according to the documentation is just:

  1. generate enrollment token on master node
  2. From the installation directory of your new node, start Elasticsearch
  3. Pass the enrollment token with the --enrollment-token parameter.

Following those instructions I get the following error:
ERROR: Skipping security auto configuration because it appears that the node is not starting up for the first time. The node might already be part of a cluster and this auto setup utility is designed to configure Security for new clusters only., with exit code 80

So this doesn't work. It makes sense that it doesn't work though
Why?

  1. The install itself indicates to start ES from systemctl not the install folder.
  2. The elasticsearch.yml file is default with the wrong info in it and the network. host not set among other important settings.
  3. Setting the node name, cluster name, network.host, restarting ES produces the exact same error response code:80.

I have tried soooo many things.

  1. I have blanked out the security autoconfig section, then added the
    xpack.security.autoconfiguration.enabled: true
    option.
  2. I removed all the security config and tried starting with no security so I could enroll. The service doesn't start without security.
  3. I tried renaming the certs created at the first startup so that ES doesn't find them and lets the master node populate them instead upon joining. Didn't work.

I need help identifying a simple order of operation for installing the software from dnf, setting the service to start and then enrolling a new node into the existing cluster. I do not see how to do this without modifying the elasticsearch.yml with the correct cluster name.
I appreciate any suggested steps to try.
Thanks.

I am continuing to try and retry.
I started over from scratch and still get the same error. Here is the pertinent section from the /var/log/elasticsearch/hostname.log

[2024-01-26T17:18:54,608][WARN ][o.e.c.s.DiagnosticTrustManager] [host4] failed to establish trust with server at []; the server provided a certificate with subject name [CN=host1], fingerprint [a03da0280f4881b329101fb769737a0cc2fc8e28], no keyUsage and no extendedKeyUsage; the certificate is valid between [2024-01-23T20:38:16Z] and [2122-12-30T20:38:16Z] (current time is [2024-01-27T00:18:54.607998584Z], certificate dates are valid); the session uses cipher suite [TLS_AES_256_GCM_SHA384] and protocol [TLSv1.3]; the certificate does not have any subject alternative names; the certificate is issued by [CN=Elasticsearch security auto-configuration HTTP CA]; the certificate is signed by (subject [CN=Elasticsearch security auto-configuration HTTP CA] fingerprint [7dad929cbd66b990a3ff95a6496cb82e59a03a49]) which is self-issued; the [CN=Elasticsearch security auto-configuration HTTP CA] certificate is not trusted in this ssl context ([xpack.security.transport.ssl (with trust configuration: StoreTrustConfig{path=certs/transport.p12, password=, type=PKCS12, algorithm=PKIX})]); this ssl context does trust a certificate with subject [CN=Elasticsearch security auto-configuration HTTP CA] but the trusted certificate has fingerprint [288543c4bbcdb394327fcdd9b56029cc874fdcae]

I removed my hostname and just put "host". host1 is the first master node and I'm trying to add host4. The reason it is not host2 is that host4 is in a different rack so there is physical separation for resliency purposes.
Thanks,
Bryan

In case you are wondering about the master discovery sections of my elasticsearch.yml file, here they are:
discovery.seed_hosts: ["host1", "host4"]
xpack.security.enabled: true
xpack.security.enrollment.enabled: true
xpack.security.http.ssl:
enabled: true
keystore.path: certs/http.p12
xpack.security.transport.ssl:
enabled: true
verification_mode: certificate
keystore.path: certs/transport.p12
truststore.path: certs/transport.p12
cluster.initial_master_nodes: ["host1:9300", "host4:9300"]
http.host: 0.0.0.0
transport.host: 0.0.0.0

my cluster.name is set to exactly the same on this host as on host1
telnet to host 9300 and 9200 work just fine. All nodes are reachable.

Looks like you have two different CA certificates on the two different nodes. All the transport certs need to be issued by the same CA.

HI David,
Thank you for responding!
I just followed the install guide. I can see in the log there that there is an issue with certs, but the process of creating a new cluster and adding a node should be straight forward. Do you know what cmds I need to run to allow the new node to just use the cert info from the cluster it is trying to join?
The install steps I posted above say to start elasticsearch and then run the enrollment cmd with the token from the cluster being joined. When you "start" elasticsearch, it does all the encryption stuff,you don't get a choice unless you modify the elasticsearch.yml but then if you do that, then there is no encryption and it won't join the cluster. I tried this over and over with different settings each time.
I look forward to your reply or others that know what I should do.
Thanks,
Bryan

I think you're following these docs? But they say that you must pass the --enrollment-token option the very first time you run Elasticsearch. Don't start it up and then try and enroll it into the cluster, that won't work because it'll already have formed its own cluster.

Thank you for responding so quickly.
So I thought so too...however, when I ran it right away after install, using this:
$ cd /usr/share/elasticsearch
$ bin/elasticsearch --enrollment-token
ES failed to run saying ES wasn't running. I will try again, moving any files that might have been used so that it tries to start it without any of the old configs or cert files.
A couple problems I forsee doing this start after install:

  1. You should not and cannot run this as root, but the ES account is created at install time by the install scripts, yet /usr/share/elasticsearch is owned by root and the group is elasticsearch.
  2. You cannot su to elasticsearch because the account creation makes it nologin and no shell.
  3. If I do sudo, then it tries to use my userid instead or says there is an error with the folder being owned by root.
  4. I did add my userid to the elasticsearch group before trying but I still got the elasticsearch is not running error.

I'll let you know what it does when I try again.
Bryan

Hard to know how to proceed here without seeing the specific messages that Elasticsearch emits.

The auto-enrollment feature is designed to make it easier to set up a cluster but to do that it makes some assumptions about the environment in which it's running that seem not to hold in your case. It might be simplest to just skip it and set up security manually instead rather than trying to investigate why it's not working.

HI David,
Thanks for the link. Of the dozens of web page documentation I was reading I had not used that one yet. Just so you know relating to joining an existing cluster, I tried following the documentation and it doesn't work. I built brand new nodes and without starting any elasticsearch service, I went to /usr/share/elasticsearch and then ran bin/elasticsearch --enrollment-token and I get the same error:
ERROR: Skipping security auto configuration because it appears that the node is not starting up for the first time. The node might already be part of a cluster and this auto setup utility is designed to configure Security for new clusters only., with exit code 80
I will try again using my first node as a CA. The instructions say to run the ./bin/elasticsearch-certutil ca cmd before starting ES, but my cluster is already active so I will stop the service, run the CA cmd and the rest of the instructions and then restart ES. Hopefully it doesn't break.
I'll report back if it doesn't go well. I have high hopes.
I don't know who can receive this request, but I really think that the page: Start the Elastic Stack with security enabled automatically | Elasticsearch Guide [8.12] | Elastic
needs to be updated, I don't see how those steps(generate token, run ES the first time with the enrollment option) could work as it should always produce the above error I sent since the TLS on the new remote node is not going to be the same as on the cluster.
Thanks,
Bryan

This error indicates that something is going awry in how you're installing ES. Either ES is running at least once, or the installation process is putting files in the data or config paths in a way that makes it look like it's not a fresh install.

That sounds true.
Here is all I am doing:
$ sudo dnf install -y --enablerepo=elasticsearch elasticsearch
Elasticsearch repository for 8.x packages 9.8 MB/s | 50 MB 00:05
....
lots of lines
...
Downloading Packages:
elasticsearch-8.12.0-x86_64.rpm 542 kB/s | 592 MB 18:39
...

Installed:
elasticsearch-8.12.0-1.x86_64

Complete!

Then I do the enrollment, I tried as my user, as root and I get the response I sent above: "ERROR: Skipping auto configuration...."

Interestingly, on a brand new node, directly after install, prior to ever being run, the /etc/elasticsearch/certs folder is populated as is the /etc/elasticsearch/elasticsearch.keystore file.

It's confusing why those files have certs/data in them at install time? I would think those should not populate until ES is run or the enrollment cmd populates them with data/certs from the cluster nodes. What do you think?

Thanks,
Bryan

Hmm yeah that sounds surprising to me and would explain this I think. I'm not too familiar with the RPM installer, maybe it's trying to be too clever or something? I mostly use the .tar.gz one (perhaps more because I'm running it for testing purposes rather than to install it system-wide).

On the second node, did you uninstalled elasticsearch before installing again, even if you download and installed it again, the data is still there. You need to remove it first, or you can also use the reconfigure option on the second node to join the first node.

sudo /usr/share/elasticsearch/elasticsearch-reconfigure-node --enrollment-token eyJ2ZXIiOiI4LjAuMCIsImFkciI6WyIxOTIuMTY4LjEuMTY6OTIwMCJdLCJmZ3IiOiI4NGVhYzkyMzAyMWQ1MjcyMmQxNTFhMTQwZmM2ODI5NmE5OWNiNmU0OGVhZjYwYWMxYzljM2I3ZDJjOTg2YTk3Iiwia2V5IjoiUy0yUjFINEJrNlFTMkNEY1dVV1g6QS0wSmJxM3hTRy1haWxoQTdPWVduZyJ9

Hi Redsearch,
Thanks for jumping in here. I need all the help I can get. neither the reconfigure option nor the elasticsearch cmd alone with the enrollment does an enrollment into the cluster.
As an update, I followed all the steps on the page David linked above: "Set up basic security for the Elastic Stack"
restarted my cluster on each node and the service starts now and shows active-green on the new node! but the health status is bad on the other node and is only healthy on the first node. When I query with:
$ curl -k -u elastic "://10.2.1.46:9200/_cluster/health?&pretty"
Enter host password for user 'elastic':

  "cluster_name" : "clustername",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 1,
  "active_shards" : 1,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

so I don't have the other node in the cluster
When I put in the other node's IP, it can't authenticate:
$ curl -k -u elastic "://10.2.1.47:9200/_cluster/health?&pretty"
Enter host password for user 'elastic':

  "error" : {
    "root_cause" : [
      {
        "type" : "security_exception",
        "reason" : "unable to authenticate user [elastic] for REST request [/_cluster/health?&pretty]",

I tried to run the command in the documentation to update the elastic superuser on the other nodes:
sudo /usr/share/elasticsearch/bin/elasticsearch-reset-password -u elastic
ERROR: Failed to determine the health of the cluster. Unexpected http status [503], with exit code 65

You should check the log of the broken node, it should be in /var/log/elasticsearch/

Yup, you were right. That log does indicate failure to establish trust:

[WARN ][o.e.c.s.DiagnosticTrustManager] [host2] failed to establish trust with server at [<unknown host>]; the server provided a certificate with subject name [CN=host1], fingerprint [a03da0280f4881b329101fb769737a0cc2fc8e28], no keyUsage and no extendedKeyUsage; the certificate is valid between [2024-01-23T20:38:16Z] and [2122-12-30T20:38:16Z] (current time is [2024-01-29T21:08:49.007059365Z], certificate dates are valid); the session uses cipher suite [TLS_AES_256_GCM_SHA384] and protocol [TLSv1.3]; the certificate does not have any subject alternative names; the certificate is issued by [CN=Elasticsearch security auto-configuration HTTP CA]; the certificate is signed by (subject [CN=Elasticsearch security auto-configuration HTTP CA] fingerprint [7dad929cbd66b990a3ff95a6496cb82e59a03a49]) which is self-issued; the [CN=Elasticsearch security auto-configuration HTTP CA] certificate is not trusted in this ssl context ([xpack.security.transport.ssl (with trust configuration: StoreTrustConfig{path=elastic-certificates.p12, password=<non-empty>, type=PKCS12, algorithm=PKIX})])

I followed the steps exactly and got no errors when transferring the cert and moving it into the correct location. I ensured the ownership/group was the elasticsearch user and the permissions were correct.

I assume if I enter the SSL manually as from the guide, that using the enrollment token is pointless, correct?
I don't know what else to try to get these nodes talking.
Here is my elasticsearch.yml file with hostnames scrubbed:

cluster.name: clustername
node.name: host2
node.attr.rack: C2B4
path.data: /elastic
path.logs: /var/log/elasticsearch
network.host: 10.2.1.47
cluster.initial_master_nodes: ["host1", "host2", "host4"]

xpack.security.enabled: true
xpack.security.enrollment.enabled: true
xpack.security.http.ssl:
  enabled: true
  keystore.path: certs/http.p12
xpack.security.transport.ssl:
  enabled: true
  verification_mode: certificate
  keystore.path: elastic-certificates.p12
  truststore.path: elastic-certificates.p12
xpack.security.transport.ssl.client_authentication: required
discovery.seed_hosts: ["10.2.1.46:9300", "10.2.1.47:9300", "10.2.1.49:9300"]
http.host: 0.0.0.0
transport.host: 0.0.0.0

Thanks for your continued help.

Ok. I do not seem to get anymore SSL or cert errors. The root-cause of that seemed to be yaml config difference between my first master node and the other nodes.
However, I cannot get the nodes to join still. I have 3 total master nodes, I decided to bring the 3rd one online to help keep me honest about my configs.
I do not know how to get the 2nd and 3rd master nodes to join node1's cluster.
/var/log/elasticsearch/host.log:

[2024-01-29T17:46:59,423][INFO ][o.e.c.c.ClusterBootstrapService] [host2] this node has not joined a bootstrapped cluster yet; [cluster.initial_master_nodes] is set to [host1]
[2024-01-29T17:46:59,428][INFO ][o.e.c.c.ClusterBootstrapService] [host2] skipping cluster bootstrapping as local node does not match bootstrap requirements: [host1]
[2024-01-29T17:47:09,433][WARN ][o.e.c.c.ClusterFormationFailureHelper] [ares2] master not discovered yet, this node has not previously joined a bootstrapped cluster, and this node must discover master-eligible nodes [host1] to bootstrap a cluster: have discovered [{host2}{zd9hTcEpSXSxwmCP0dm4ew}{RhIedo5cRSy3uCJ1C2NSTw}{ares2}{10.2.1.47}{10.2.1.47:9300}{cdfhilmrstw}{8.12.0}{7000099-8500008}]; discovery will continue using [127.0.0.1:9300, 127.0.0.1:9301, 127.0.0.1:9302, 127.0.0.1:9303, 127.0.0.1:9304, 127.0.0.1:9305, [::1]:9300, [::1]:9301, [::1]:9302, [::1]:9303, [::1]:9304, [::1]:9305] from hosts providers and [{host2}{zd9hTcEpSXSxwmCP0dm4ew}{RhIedo5cRSy3uCJ1C2NSTw}{host2}{10.2.1.47}{10.2.1.47:9300}{cdfhilmrstw}{8.12.0}{7000099-8500008}] from last-known cluster state; node term 0, last-accepted version 0 in term 0; for troubleshooting guidance, see https://www.elastic.co/guide/en/elasticsearch/reference/8.12/discovery-troubleshooting.html

I checked and my cluster is setting discovery-type automatically to multi-node, so that is good.
I tried removing all the files from my data folder(/elastic) and restarting ES. It started and just recreated the files again, so it doesn't seem to help.
And why is ES trying all these ports 9300-9305 instead of just 9300?
Both nodes produce the error above, so it's definitely the same problem.
I've been messing with the
cluster.initial_master_nodes and discovery.seed_hosts and don't know if that is where the problem is and so far no combination works. I try with both and without both, with just one of them and other.
I noticed that on one of the installations ES added the ":9300" port which was interesting. ES didn't do that on every install. So I tried with and without:

cluster.initial_master_nodes:  ["host1:9300", "host2:9300", "host4:9300"]
cluster.initial_master_nodes:  ["host1", "host2", "host4"]
discovery.seed_hosts: ["host1", "host2", "host4"]
discovery.seed_hosts: ["host1:9300", "host2:9300", "host4:9300"]

I also tried with just host1 and not host2 and host4. The log entry above is showing an attempt with just host1 in the cluster.initial_master_nodes by itself.
Is there where I should be focusing or somewhere else?
Thanks,

I think you should start over again, it look like you have 3 separate cluster, they will not join to another cluster. Start with the 1st node, make sure you have open port 9200, 9300 on this node, then on node 2 open port 9200,9300 on the firewall . Try to join the first cluster. I think you are getting confuse with the port. Port 9300 is use for transport and 9200 is for http. You should uninstall Elasticsearch on each server before starting over.

It doesn't, and saying incorrect stuff like this is really unhelpful. This log message says this node positively hasn't joined a cluster yet:

master not discovered yet, this node has not previously joined a bootstrapped cluster, and this node must discover master-eligible nodes [host1] to bootstrap a cluster: have discovered [{host2}{...}]; discovery will continue using [127.0.0.1:9300, {...}] from hosts providers and [{host2}{...}] from last-known cluster state; node term 0, last-accepted version 0 in term 0; for troubleshooting guidance, see https://www.elastic.co/guide/en/elasticsearch/reference/8.12/discovery-troubleshooting.html

It also says that cluster.initial_master_nodes is set (it shouldn't be) and discovery.seed_hosts is not set (it should be). See these docs for more details.

1 Like

Thank you David! I was really discouraged thinking about having to start over again. I feel like I'm so close. I will try your suggestions and report back.
Thanks,
Bryan

1 Like