Elasticsearch nodes are visible to each other in GCE, but won't form a cluster

I'm trying to set up a simple cluster using 2 GCE VMs (test-elk1 and test-elk2) from inside docker containers.

I've setup firewall rules so they are completely visible to each other, so performing:
curl http://<instance_internal_IP>:9200
and
curl http://test-elk1:9200
from test-elk2 work just fine.

The problem is that they're not forming a cluster, and the logs are not clear on the reason why:

    root@test-elk2:/home/yago/elk/es-data-node# docker-compose logs -f | grep '"level": "DEBUG"'
    es-data_1  | {"type": "server", "timestamp": "2020-07-10T22:14:25,277Z", "level": "DEBUG", "component": "o.e.d.z.ElectMasterService", "cluster.name": "test-elk", "node.name": "test-elk2", "message": "using minimum_master_nodes [-1]" }
    es-data_1  | {"type": "server", "timestamp": "2020-07-10T22:14:32,431Z", "level": "DEBUG", "component": "o.e.a.ActionModule", "cluster.name": "test-elk", "node.name": "test-elk2", "message": "Using REST wrapper from plugin org.elasticsearch.xpack.security.Security" }
    es-data_1  | {"type": "server", "timestamp": "2020-07-10T22:14:33,151Z", "level": "DEBUG", "component": "o.e.d.SettingsBasedSeedHostsProvider", "cluster.name": "test-elk", "node.name": "test-elk2", "message": "using initial hosts [test-elk1]" }
    es-data_1  | {"type": "server", "timestamp": "2020-07-10T22:14:35,737Z", "level": "DEBUG", "component": "o.e.t.n.Netty4Transport", "cluster.name": "test-elk", "node.name": "test-elk2", "message": "using profile[default], worker_count[2], port[9300-9400], bind_host[[0.0.0.0]], publish_host[[]], receive_predictor[64kb->64kb]" }
    es-data_1  | {"type": "server", "timestamp": "2020-07-10T22:14:35,758Z", "level": "DEBUG", "component": "o.e.t.TcpTransport", "cluster.name": "test-elk", "node.name": "test-elk2", "message": "binding server bootstrap to: [0.0.0.0]" }
    es-data_1  | {"type": "server", "timestamp": "2020-07-10T22:14:35,935Z", "level": "DEBUG", "component": "o.e.t.TcpTransport", "cluster.name": "test-elk", "node.name": "test-elk2", "message": "Bound profile [default] to address {0.0.0.0:9300}" }
    es-data_1  | {"type": "server", "timestamp": "2020-07-10T22:14:36,000Z", "level": "DEBUG", "component": "o.e.d.SeedHostsResolver", "cluster.name": "test-elk", "node.name": "test-elk2", "message": "using max_concurrent_resolvers [10], resolver timeout [5s]" }
    es-data_1  | {"type": "server", "timestamp": "2020-07-10T22:15:06,618Z", "level": "DEBUG", "component": "o.e.d.PeerFinder", "cluster.name": "test-elk", "node.name": "test-elk2", "message": "Peer{transportAddress=<test-elk1 internal IP>:9300, discoveryNode=null, peersRequestInFlight=false} connection failed", 
    es-data_1  | {"type": "server", "timestamp": "2020-07-10T22:15:37,260Z", "level": "DEBUG", "component": "o.e.d.PeerFinder", "cluster.name": "test-elk", "node.name": "test-elk2", "message": "Peer{transportAddress=<test-elk1 internal IP>:9300, discoveryNode=null, peersRequestInFlight=false} connection failed", 
    es-data_1  | {"type": "server", "timestamp": "2020-07-10T22:16:07,360Z", "level": "DEBUG", "component": "o.e.d.PeerFinder", "cluster.name": "test-elk", "node.name": "test-elk2", "message": "Peer{transportAddress=<test-elk1 internal IP>:9300, discoveryNode=null, peersRequestInFlight=false} connection failed", 
    es-data_1  | {"type": "server", "timestamp": "2020-07-10T22:16:37,481Z", "level": "DEBUG", "component": "o.e.d.PeerFinder", "cluster.name": "test-elk", "node.name": "test-elk2", "message": "Peer{transportAddress=<test-elk1 internal IP>:9300, discoveryNode=null, peersRequestInFlight=false} connection failed", 
    es-data_1  | {"type": "server", "timestamp": "2020-07-10T22:17:08,538Z", "level": "DEBUG", "component": "o.e.d.PeerFinder", "cluster.name": "test-elk", "node.name": "test-elk2", "message": "Peer{transportAddress=<test-elk1 internal IP>:9300, discoveryNode=null, peersRequestInFlight=false} connection failed",
    [...]

Some other logs in TRACE mode show that it connects then "unregisters"?

    root@test-elk2:/home/yago/elk/es-data-node# docker-compose logs | grep 0x7ed0e599
    es-data_1  | {"type": "server", "timestamp": "2020-07-10T22:39:38,368Z", "level": "TRACE", "component": "o.e.t.n.ESLoggingHandler", "cluster.name": "test-elk", "node.name": "test-elk2", "message": "[id: 0x7ed0e599] REGISTERED" }
    es-data_1  | {"type": "server", "timestamp": "2020-07-10T22:39:38,371Z", "level": "TRACE", "component": "o.e.t.n.ESLoggingHandler", "cluster.name": "test-elk", "node.name": "test-elk2", "message": "[id: 0x7ed0e599] CONNECT: 192.168.96.2/192.168.96.2:9300" }
    es-data_1  | {"type": "server", "timestamp": "2020-07-10T22:40:08,378Z", "level": "TRACE", "component": "o.e.t.n.ESLoggingHandler", "cluster.name": "test-elk", "node.name": "test-elk2", "message": "[id: 0x7ed0e599] CLOSE" }
    es-data_1  | {"type": "server", "timestamp": "2020-07-10T22:40:08,379Z", "level": "TRACE", "component": "o.e.t.n.ESLoggingHandler", "cluster.name": "test-elk", "node.name": "test-elk2", "message": "[id: 0x7ed0e599] CLOSE" }
    es-data_1  | {"type": "server", "timestamp": "2020-07-10T22:40:08,380Z", "level": "TRACE", "component": "o.e.t.n.ESLoggingHandler", "cluster.name": "test-elk", "node.name": "test-elk2", "message": "[id: 0x7ed0e599] UNREGISTERED" }

I can also see from the trace that the IP it connects to is the IP that test-elk1 is publishing to, but as it is an internal IP, it's unreachable from test-elk2:

    root@test-elk1:/home/yago/elk# docker-compose logs --tail=15 elasticsearch
    Attaching to elk_elasticsearch_1
    elasticsearch_1  | {"type": "server", "timestamp": "2020-07-10T22:02:55,528Z", "level": "INFO", "component": "o.e.d.DiscoveryModule", "cluster.name": "test-elk", "node.name": "test-elk1", "message": "using discovery type [zen] and seed hosts providers [settings]" }
    elasticsearch_1  | {"type": "server", "timestamp": "2020-07-10T22:02:58,569Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "test-elk", "node.name": "test-elk1", "message": "initialized" }
    elasticsearch_1  | {"type": "server", "timestamp": "2020-07-10T22:02:58,572Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "test-elk", "node.name": "test-elk1", "message": "starting ..." }
    elasticsearch_1  | {"type": "server", "timestamp": "2020-07-10T22:02:59,006Z", "level": "INFO", "component": "o.e.t.TransportService", "cluster.name": "test-elk", "node.name": "test-elk1", "message": "publish_address {192.168.96.2:9300}, bound_addresses {0.0.0.0:9300}" }
    elasticsearch_1  | {"type": "server", "timestamp": "2020-07-10T22:02:59,037Z", "level": "INFO", "component": "o.e.b.BootstrapChecks", "cluster.name": "test-elk", "node.name": "test-elk1", "message": "bound or publishing to a non-loopback address, enforcing bootstrap checks" }
    elasticsearch_1  | {"type": "server", "timestamp": "2020-07-10T22:02:59,141Z", "level": "INFO", "component": "o.e.c.c.Coordinator", "cluster.name": "test-elk", "node.name": "test-elk1", "message": "cluster UUID [drqPHHGvQ66kEBW2DnR7dA]" }

I can also see from the trace logs that test-elk2 manages to connect to test-elk1's real internal IP address and reads some properties from the node, one of them being the publish_address, which then it tries to connect to, unsuccessfully. Or at least that's my guess.

My configurations are:

test-elk1 elasticsearch.yml:

    cluster.name: "test-elk"
    node.name: "test-elk1"
    network.host: "0.0.0.0"

    discovery.seed_hosts: ["test-elk1","test-elk2"]

    xpack.license.self_generated.type: trial
    xpack.security.enabled: true
    xpack.monitoring.collection.enabled: true

test-elk2 elasticsearch.yml:

    cluster.name: "test-elk"

    node:
      name: "test-elk2"
      master: true
      voting_only: false
      data: true
      ingest: false

    network.host: "0.0.0.0"
    discovery.seed_hosts: ["test-elk1"]
    logger.org.elasticsearch.discovery: TRACE
    logger.org.elasticsearch.transport: TRACE

    xpack:
      license.self_generated.type: trial
      monitoring.collection.enabled: true
      security.enabled: true

If you need more info, let me know. Also if you have a similar setup, could you pass some config tips? I've been going around just with trial and error strategy, but nothing seems to allow the cluster to be formed, my guess is the problem is with the elasticsearch.yml config files but I can't pinpoint where.

Have you opened 9300 on both nodes to be accessible from other node?

Yes, I used the same docker configuration and I setup ingress and egress firewall rules for that, on all ports just to be sure , both instances are reachable from each other via GCE internal IP.

I don't understand the difference.

Can you docker exec into each container and run
curl -s http://localhost:9200/_nodes/transport?pretty
curl -s http://localhost:9200/_cat/nodes

I can only access the REST API of the node test-elk1:

root@test-elk1:/home/yago/elk# curl -u 'elastic:changeme' http://localhost:9200/_nodes/transport?pretty
{
  "_nodes" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "cluster_name" : "test-elk",
  "nodes" : {
    "hiMcEmmJRKadNEEaEue82g" : {
      "name" : "test-elk1",
      "transport_address" : "172.18.0.2:9300",
      "host" : "172.18.0.2",
      "ip" : "172.18.0.2",
      "version" : "7.4.0",
      "build_flavor" : "default",
      "build_type" : "docker",
      "build_hash" : "22e1767283e61a198cb4db791ea66e3f11ab9910",
      "roles" : [
        "ingest",
        "master",
        "data",
        "ml"
      ],
      "attributes" : {
        "ml.machine_memory" : "3883986944",
        "xpack.installed" : "true",
        "ml.max_open_jobs" : "20"
      },
      "transport" : {
        "bound_address" : [
          "0.0.0.0:9300"
        ],
        "publish_address" : "172.18.0.2:9300",
        "profiles" : { }
      }
    }
  }
}
root@test-elk1:/home/yago/elk# curl -u 'elastic:changeme' http://localhost:9200/_cat/nodes
172.18.0.2 12 87 9 0.98 1.03 0.52 dilm * test-elk1

The node test-elk2 can't be queried because I can't setup the bootstrap passwords, because the node didn't join any cluster, and I receive the following message:

Unexpected response code [503] from calling PUT http://172.20.0.2:9200/_security/user/apm_system/_password?pretty
Cause: Cluster state has not been recovered yet, cannot write to the [null] index
[root@f5dc9d4581f6 elasticsearch]# bin/elasticsearch-setup-passwords auto

Failed to determine the health of the cluster running at http://172.20.0.2:9200
Unexpected response code [503] from calling GET http://172.20.0.2:9200/_cluster/health?pretty
Cause: master_not_discovered_exception

It is recommended that you resolve the issues with your cluster before running elasticsearch-setup-passwords.
It is very likely that the password changes will fail when run against an unhealthy cluster.

Do you want to continue with the password setup process [y/N]y

Initiating the setup of passwords for reserved users elastic,apm_system,kibana,logstash_system,beats_system,remote_monitoring_user.
The passwords will be randomly generated and printed to the console.
Please confirm that you would like to continue [y/N]y



Unexpected response code [503] from calling PUT http://172.20.0.2:9200/_security/user/apm_system/_password?pretty
Cause: Cluster state has not been recovered yet, cannot write to the [null] index

Possible next steps:
* Try running this tool again.
* Try running with the --verbose parameter for additional messages.
* Check the elasticsearch logs for additional error details.
* Use the change password API manually. 

ERROR: Failed to set password for user [apm_system].

I can setup the bootstrap password via keystore but it doesn't seem to work:

[root@f5dc9d4581f6 elasticsearch]# bin/elasticsearch-keystore add "bootstrap.password"
Setting bootstrap.password already exists. Overwrite? [y/N]y
Enter value for bootstrap.password: 
[root@f5dc9d4581f6 elasticsearch]# curl -u 'elastic:changeme' http://172.20.0.2:9200/_security/_authenticate?pretty
{
  "error" : {
    "root_cause" : [
      {
        "type" : "security_exception",
        "reason" : "failed to authenticate user [elastic]",
        "header" : {
          "WWW-Authenticate" : "Basic realm=\"security\" charset=\"UTF-8\""
        }
      }
    ],
    "type" : "security_exception",
    "reason" : "failed to authenticate user [elastic]",
    "header" : {
      "WWW-Authenticate" : "Basic realm=\"security\" charset=\"UTF-8\""
    }
  },
  "status" : 401
}
[root@f5dc9d4581f6 elasticsearch]#

@yagodorea
This is what my understanding of your setup.

  1. Two hosts each running own docker: test-elk1 (192.168.96.2) and test-elk2 (192.168.96.?)
  2. On test-elk1 ES transport is bound to and publishes 172.18.0.2:9300 this will not be reachable from test-elk2.
  3. On both hosts 9200 and 9300 are reachable from each other.
  4. test-elk1 resolves to instance_internal_ip 192.168.96.2

You might have forwarded 9200 on test-elk1 instance to container's 9200 for curl http://test-elk1:9200 to work from test-elk2.

I have not tried this kind of setup before. But this may work

  1. on both elasticsearch.yml remove network.host: 0.0.0.0 and add network.bind_host: [_site_, _local_], network.publish_host=<node's name (test-elk1) or instance_internal_ip (192.168.96.2) >.
  2. forward host port 9300 to container's 9300 on both instances (docker run -p 9300:9300)

I would recommend to get setup working without xpack security first. This will reduce number of variables. Once cluster is ready you can enable security.

Thanks, Vinayak! It worked that way. I think the first node was bound to the instance's network but was publishing to an internal address. Changing it manually made it work, now it's publishing to the instance's local IP address:

"transport" : {
        "bound_address" : [
          "127.0.0.1:9300",
          "172.20.0.2:9300"
        ],
        "publish_address" : "<instance's local IP>:9300",
        "profiles" : { }
      }

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.