The node does not automatically return to the Elasticsearch cluster

Good afternoon! I have a cluster of five nodes: three node.master and all five node.data. There is a problem when testing it. If you turn off one node and turn it on again, its elasticsearch instance does not see the cluster.
Only the simultaneous restart of the elasticsearch service on all nodes helps.

Here is the node config:

cluster.name: lynx
node.name: phd-1
node.master: true
node.data: true
network.bind_host: 0.0.0.0
network.host: 10.10.10.15
http.port: 9200
discovery.zen.ping.unicast.hosts: [“10.10.10.11”, “10.10.10.12”, “10.10.10.13”, “10.10.10.14”, “10.10.10.15”]
discovery.zen.minimum_master_nodes: 2

Please tell me what I missed and how can I fix this problem?

Can you please provide the output of the _cat/nodes API?

If you enter the command on this node: curl -XGET 'http: // localhost: 9200 / _cluster / health'
I see this error:

{"error":{"root_cause":[{"type":"master_not_discovered_exception","reason":null}               ],"type":"master_not_discovered_exception","reason":null},"status":503}root@phd- 

What about the other nodes? Which version of Elasticsearch are you using?

I hope I am doing it right:

curl http://localhost:9200/_cat/nodes

The error is the same as before:

{"error":{"root_cause":[{"type":"master_not_discovered_exception","reason":null}],"type":"maste

When I run the command: curl -XGET 'http: //localhost: 9200/_cluster/helten on other nodes:

{"cluster_name":"lynx_new","status":"green","timed_out":false,"number_of_nodes":4,"number_of_data_nodes":4,"active_primary_shards":0,"active_shards":0,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":0,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":100.0}root@poi-u1:/opt

Product version: 7.7.2

If you are using Elasticsearch 7.7 these settings are not the correct ones as cluster configuration changed in Elasticsearch 7.x. I suspect you may need to set up the cluster again, and when you do so make sure you are looking at the docs belonging to the correct version.

1 Like

Thanks!

Apparently, I gave you the wrong version of the product, version:

"version": {
     "number": "6.8.6"

Full output command curl -XGET 'localhost:9200':

{
  "name" : "phd-1",
  "cluster_name" : "lynx_new",
  "cluster_uuid" : "_na_",
  "version" : {
    "number" : "6.8.6",
    "build_flavor" : "default",
    "build_type" : "deb",
    "build_hash" : "3d9f765",
    "build_date" : "2019-12-13T17:11:52.013738Z",
    "build_snapshot" : false,
    "lucene_version" : "7.7.2",
    "minimum_wire_compatibility_version" : "5.6.0",
    "minimum_index_compatibility_version" : "5.0.0"
  },
  "tagline" : "You Know, for Search"
}

It turns out that the problem persists. I even disabled the firewall completely during testing.
I apologize for initially making a mistake with the version!

curl http://localhost:9200/_cat/nodes:

10.10.10.12 33 67 3 0.28 0.39 0.36 di  - psis-1
10.10.10.13 32 52 2 0.31 0.37 0.36 mdi - poi-1
10.10.10.14 31 43 3 0.29 0.56 0.47 di  - pdi-1
10.10.10.11 33 73 3 0.32 1.85 1.55 mdi * pfur-1

No options on how to fix this problem?

I see two different cluster names. Is that due to you changing it at some point or does it differ between the nodes?

Yes, I changed the cluster name for all nodes. I double-checked, now all nodes have a name: cluster.name: lynx_new
When I first assembled a cluster he had the name lynx. I rebuilt it with a new name, without changing other parameters.
From the problem node, I checked access to one of the master nodes located in the cluster:

telnet 10.10.10.13 9300
Trying 10.10.10.13...
Connected to 10.10.10.13.
Escape character is '^]'.

Can it reach all the other nodes the same way? Can the other nodes reach this node?

Yes, this host sees other nodes, these nodes see this host.
I looked at the cluster log file /var/log/elasticsearch/lynx_new.log
I am confused by these errors:

[phd-1] failed to resolve host [“10.10.10.11”]

UnicastZenPing   ] [phd-1] failed to resolve host [“10.10.10.12”]

[o.e.d.z.UnicastZenPing   ] [phd-1] failed to resolve host [“10.10.10.12”]
java.net.UnknownHostException: “10.10.10.12”

Perhaps this entry is incorrect:

discovery.zen.ping.unicast.hosts: [“10.10.10.11”, “10.10.10.12”, “10.10.10.13”, “10.10.10.14”, “10.10.10.15”]

Yes, it seems to be so, the entry:

[“10.10.10.11”, “10.10.10.12”, “10.10.10.13”, “10.10.10.14”, “10.10.10.15”] 

WRONG
I did this:

['10.10.10.11', '10.10.10.12', '10.10.10.13', '10.10.10.14', '10.10.10.15']

And it works

Yes, the mistake was that you were using (LEFT DOUBLE QUOTATION MARK) and (RIGHT DOUBLE QUOTATION MARK) rather than " (QUOTATION MARK) around the IP addresses.

2 Likes

I want to know if the following behavior is normal: when I shutdown two out of three master nodes, my cluster fall apart? I believed that the cluster will work with one master node.
I have disabled two master nodes from three and one node data. All five nodes - data nodes. On the two remaining nodes, I get the error:

{"error":{"root_cause":[{"type":"master_not_discovered_exception","reason":null}],"type":"master_not_discovered_exception","reason":null},"status":503}root@pdi-

After executing the command:

curl -XGET 'http://localhost:9200/_cluster/health'

You always need a majority of master eligible nodes present for a healthy cluster, so what you are seeing is expected.

1 Like

Yes, I understood this, because I specified the parameter: discovery.zen.minimum_master_nodes: 2
Based on this, a cluster requires at least two master nodes.

Thank you very much!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.