Cluster TLS encryption, master nodes and certificate syntax specifics

security

(Ryan Downey) #1

Version 6.5.4
9 node ES cluster
3 node Kibana cluster

I recently had a problem implementing TLS within our cluster. After implementing TLS in our QA environment and taking a lot of notes along the way, I thought that deploying it to our prod environment would have been fairly easy. Initially I went through the steps that I annotated for our QA envrionment and created our CA/certs off that documentation, obviously changing node names etc for the prod environment.

After deploying the certs and configuring the yaml files in our prod envrionment I figured that things would connect fairly quickly. Unfortuantely it didn't work out so easily. I initially started with just one ES node abc.gov (xpack/TLS enabled) and the kibana node 234.gov (xpack/TLS enabled) that its paired with through the kibana yaml file. This didnt work as I kept getting a Failed to authenticate user Kibana error message, which obviously mean that the kibana password isnt working. I turned on three of the ES nodes in the cluster (xpack/TLS enabled) to see if that would help. No luck. I turned on all of the ES nodes, no luck.

I did some troubleshooting, doublechecking certs, yaml syntax , etc with no luck. I then stumbled across an error in our Kibana Alerts section that said "Low - Not resolved - Configuring TLS will be required to apply a Gold or Platnium license when security is enabled." As we had just updated our license I thought that maybe there was a license issue, there wasn't in the end, but this sent me down the perverbial rabbit hole thinking that the license was messed up for some reason.

After troubleshooting the license issue and determining that this probably wasnt the problem I went looking for other possible issues. I ran across two other potential problems, the first being that we dont neccessarily have dedicated master nodes although our three kibana nodes are linked to abc5app.gov, def5app.gov and ghi5app.gov. Our system is setup with discovery.zen.minimum_master_nodes: 3 but everything elses is set on the default master.node settings, so nothing specifically dedicated if I'm understanding this correctly.

The second issue was that our discovery.zen.ping.unicast.hosts: were set up as ABC5APP, DEF5APP, GHI5APP, etc yet when we created our certs we just put in abc5app, def5app, ghi5app, etc. So syntactically these are not the same.

After recreating the certs with all or our possible syntax variables and deploying them I went back to trying to connect abc5app.gov with kibana 234app.gov, no luck. I then turned on the three ES nodes that are linked to the kibana nodes and turned on all three kibana nodes, no luck. After turning all of the nodes in our cluster back on with the new certs that had all the syntax combos implemented everything linked up. So the questions here after this multiparagraph report are this:

  • Is my assumption correct that in order to implement TLS your cluster needs access to a master node? Not sure if I'm phraseing that correctly but it seems like this could have been a potential issue for our TLS deployment. The lack of access to a master node could have been creating the Failed to authenticate error potentially?
  • How important is it that we have our master nodes dedicated so that ES doen't just pick one? As our envrionments are still new but will grow later it seems like establishing master nodes now will help us as we implement hot warm architecture in the future.
  • As far as syntax goes for cert creation how specific does the --dns field have to be compared to the hosts listed in discovery.zen.ping.unicast.host? ie abc5app vs ABC5APP
  • Is there a way to test your certs via curl for server to server so that you can know if they are the issue or not? I've tried things after researching online to no avail.

(Tim Vernum) #2

I think I understand the general question you're asking, so hopefully my explanation below will explain what you need. Also, the notes below apply to 6.x versions of Elasticsearch and there are changes to cluster formation in 7.x so these notes will not be accurate for Elasticsearch 7 or later.

When an Elasticsearch node starts it attempts to join (or form) a cluster. To do that it must be able to connect to at least "minimum_master_nodes" nodes that as marked as _master eligible" (including itself).
In your case, since you have discovery.zen.minimum_master_nodes: 3 and all your nodes are master eligible, you need to have connectivity between 3 nodes in order to form a cluster (Note: The setup you have is not advisible, see the explanation below).

If a node cannot join/form a cluster, then it cannot do very much - in particular, it cannot/will not respond to any requests for data (even if it has a stale copy of that data).

Elasticsearch security supports a number of different authentication mechanisms, with a variety of configurations. Many of those mechanisms rely on data stored internally within Elasticsearch indices. For example the current state of the elastic and kibana users (including their hashed password) are stored within an Elasticsearch index.

If you try to connect to a node that was not able to join a cluster, then that node does not have access to your security indices and will not be able to authenticate the kibana user, or any other user than depends on these indices.
For this reason we recommend that you have access to a superuser that does not depend on the security indices (e.g. a superuser configured in the file realm).

So, in your setup it is not possible to start a single node of Elasticsearch and then connect via Kibana because that node is not a cluster.

Dedicated master nodes have a specific meaning in Elasticsearch terminology, which is a node that is master eligible (node.master is true or unset), and does not have any other role, in particular it does store any data (node.data isfalse`).

Dedicated master node are particularly useful in large clusters where the overhead of managing the cluster state is relatively high and you want to ensure that the nodes that do this work are not also trying to respond to search / ingest requests, etc.

It doesn't sound like you want dedicated master nodes (but it depends very much on your workload), but that you simply want to reduce the number of master eligible nodes.

Your number 1 priority should be to ensure that your cluster is working safely and yours is not. Your cluster is a risk of "split brain" which can lead to data loss or corruption.

You have 9 nodes, and all of them are master eligible.
You have also set minimum_master_nodes to 3, which means any 3 nodes can form a cluster on their own.
So, in your case if a network partition happens (and they do), your 9 nodes could clump into 3 groups of 3 nodes each and every one of those groups would form their own cluster and happily continue on.
But now you don't have 1 single cluster with a consistent view of the data, you have 3 clusters with potentially different views of the data. AKA a big mess.

So, to fix this you have 2 options.

  1. Change minimum_master_nodes to 5. It needs to be set to an absolute majority of the master eligible nodes in the cluster, which is 9/2 + 1 => 5. That will mean that you need to have any 5 nodes online and connecting to one another in order to form a cluster, and the other 4 will not be able to form their own separate cluster. This will work, but it's problematic because (a) you need to keep 5 nodes online all the time and (b) if you decide to expand your cluster to 10 nodes, then you need to make sure you change minimum_master_nodes to 6 before you add that extra node (or else you risk creating two split clusters of 5 nodes each);
    So, instead I would recommend:
  2. Change your cluster so only 3 of the nodes are master eligible. They can keep storing data if you wish (and for a 9 node cluster, you probably want them to store data), but you set node.master: true on those 3 nodes, and node.master: false on every other node. Then you switch minimum_master_nodes to 2 which is the absolute majority of 3 (3/2 +1). This will mean that any 2 of those 3 nodes need to be online at a time to form a cluster, along with as many of the other nodes as you want/need.
    From a fault tolerance point of view, this setup will mean you can afford to lose one of those master eligible nodes and still maintain a cluster (and then you can mark one of the other nodes as master eligible to bring back that level of resilience). If you need stronger resilience guarantees than that, then you can have 5 master eligible nodes, and set minimum_master_nodes to 3.
    I would also recommend you turn off node.ingest and node.ml on those master eligible nodes, just to segregate them from some of the potenaly workload.

DNS is not case sensitive, so this shouldn't be a problem. However, due to the way networking and discovery works in Elasticsearch, it is typically the case that you need the IP addresses in the certificates as well as the DNS names.

No, curl cannot communicate with the transport port (9300).

Nowhere in your description of your issue can I see a place where you consulted the Elasticsearch logs. It may that you just didn't mention it, but for the sorts of things that seemed to go wrong (cluster not forming, insuffient master nodes, certificates being rejected) the log files contain a great deal of diagnostic information that can help pinpoint the problem quite quickly.

You ought to get in the habit of checking the logs of the elasticsearch nodes as your first point of call when troubleshooting these sorts of issues.


(Ryan Downey) #3

Tim,

I appreciate you taking the time to respond, I now it took me a while to type up my post so I'm sure it took some time for you to lay all of this out as well. I'll re-read this a few times and start implementing your suggestions into our environment so that we have a better experience and handle on things. Sorry about not mentioning the ES logs, I did look into them, at which I saw the master node problem but I still wasn't able to get things up and running which kind of sent me down the rabbit hole of searching/looking at things that werent really the problem. I for sure will be taking a closer look at the logs on each portion of the cluster to aid in my troubleshooting. I think I was a lot more focused on the Kibana logs then the ES logs, even though the ES ones is where I should have been focusing, so I understand your point on that front.

Again, appreciate you taking the time to explain all of my questions for me. Enjoy the rest of your day.