Elasticsearch discovery issues kubernetes

Deploying elasticsearch 7.17.4 into kubernetes {rancher}, and have noticed that elastic doesn't form a cluster, it partitions into 3 separate master nodes.

How do i increase the level of debugging on the masters, as all the docs seem to be wrong.

Secondly how can I verify that my baremetal cluster isn't the issue given, if deploy the same helm chart in eks, everything works like a charm.

Is there anything special I'd have to a baremeta cluster?

By default Elasticsearch will emit enough logs to diagnose the problem, no need to look for debug logs.

If you need help understanding the logs, share them here. You'll need logs covering at least 5 mins from all nodes.

So there's an issue, the logs only show one node joining the a cluster essentially itself. All 'nodes behave like this, it seems that they cannot resolve the respective node names and assume single node discovery.

Yet when I use getbyhostname to verify if resolution works, it appears to be fine. Explicitly setting the discovery mode makes no difference.

Is there a difference between the way the rpm and single non rpm binary works? My image uses the rpm instead of the compressed tar binary. I'm deploying 7.17.4.

It'd be useful if you shared the logs and your config, otherwise we're really just making educated guesses.

As Mark says we can only guess at the problem from a vague description of what you're seeing that you think to be relevant.

One possible guess is this situation. If that describes what you're seeing then you can use the remedy in the docs:

If you intended to form a new multi-node cluster but instead bootstrapped a collection of single-node clusters...

Unfortunately due to company rules I can't send you the data you require, however what I can tell you is that the deployment is a statefulset in k8s and also the kubernetes deployment is bare metal running Rancher.

So my question is, given this isn't on a cloud provider how does the discovery seeding work, as there is no api as far as I'm aware of to leverage? I'm using the Elasticsearch helm chart for deployment for 7.17.x

Below is a section of the rendered chart related to discovery.

 env:
          - name: node.name
            valueFrom:
              fieldRef:
                fieldPath: metadata.name
          - name: cluster.initial_master_nodes
            value: "elasticsearch-master-0,elasticsearch-master-1,elasticsearch-master-2,"
          - name: discovery.seed_hosts
            value: "elasticsearch-master-headless"
          - name: cluster.name
            value: "elasticsearch"
          - name: network.host
            value: "0.0.0.0"
          - name: cluster.deprecation_indexing.enabled
            value: "false"

It looks like you set discovery.seed_hosts: elasticsearch-master-headless which means Elasticsearch will do a lookup for this name and use all the addresses in the response for discovery.

The behaviour in a aws EKS deployment is somewhat different, in that the cluster is formed properly, but presumably in the case of AWS it takes advantage of api, in the case of a bare metal rancher cluster, I'm guessing this going to behave in a slightly different way?

Is the discovery config still valid ?

No, not unless you explicitly tell it to (e.g. install the discovery-ec2 plugin and set discovery.seed_providers: ec2, see these docs).

Ah ok so in the case of bare metal clusters how does this work, I presume we rely on pod level resolution ?

Elasticsearch doesn't do anything different, it does a lookup for the name(s) you configure and uses all the addresses in the response. The specific library function it calls for this lookup is getaddrinfo() which can be configured to behave differently in different environments (usually it uses /etc/hosts and DNS but many other options are available). You'll need to ask a local expert for the details of how name lookup works in your environment, sorry, that's not something I can help with here as it's not really anything to do with Elasticsearch.

I guess the follow up question, given my cluster is baremetal should I configuring my chart differently in terms of discovery ?

Secondly, what else can I do from an elastic perspective to further debug this, the logs only show that a single node cluster has been formed it makes no mention of any other nodes. So I have to conclude that I need to treat a baremetal deployment in a different way ?

Well given its k8s cluster that would be coreDNS . is there any other debug available to me which I can turn on to get a better idea about whats going on ? The logs are ok ish but are definitely light on the discovery process .

I think the docs I linked earlier say what to do here.

Yes that clearly works in the context of a cluster built in a non kubernetes environment. I'm specifically talking about a kubernetes environment with ephemeral nodes ? I presume from the short replies, that elastic on kubernetes let alone bare metal k8s isn't something that a lot of people know a great deal about ?

I don't really understand what you're asking here. The docs I linked apply to all operating environments. If you're having trouble getting Kubernetes and ES to play nicely by hand then maybe you'd be better off using something like Elastic Cloud on Kubernetes | Deploy and Orchestrate Elasticsearch on Kubernetes?

and that's the problem, its a very simple question, given I have ephemeral containers in k8s how can I debug a discovery problem with the instructions from your docs which are clearly aimed at a non k8s deployment.

Second question how can I increase the debug levels in the logs, such that I can see what happens during discovery.

Unfortunately the cloud option is not a viable option for me.

I don't think you have a discovery problem, because ...

... there is no need to adjust any logging levels to diagnose discovery problems in 7.17. If you were having a discovery problem, the logs would already be full of debugging information about it.

Instead, I think you're having a cluster bootstrapping problem, and the docs I linked above tell you how to both diagnose and fix it. These docs apply to all environments, there's nothing about them which aims at any particular setup.

ECK is something you can run on your own local K8s environment, effectively a private cloud.

1 Like