Unstable after starting nodes

Tex · September 21, 2019, 11:52am

Hi,

ES: v7.3.2

I create a brand new cluster with 3 master nodes, no problem all running fine. but each time that i want to add a node ( no master ) after some time ( 1h ) the node disappear, monitoring show node was removed, and the cluster is very slow and unstable. then i stop the node and everything come back fine.

I try this with several other nodes in different vlan , same result.

Is there a trick in the discovery part ?

Here's my config of master , for nodes i just put node.master: false

cluster.name: blabla
node.name: blabla-1
path.data: /opt/elasticsearch/data
path.logs: /var/log/elasticsearch
#bootstrap.memory_lock: true
bootstrap.system_call_filter: false
network.host: site,local
http.port: 9200
discovery.seed_hosts: ["blabla-1", "blabla-2","blabla-3"]
cluster.initial_master_nodes: ["blabla-1", "blabla-2","blabla-3"]
node.master: true
node.data: true

error it give when i add a node is :

master node changed {previous [{blabla-2}{XCPed0npS5m2Sub0AfqTQw}{38O_1NF2RWSFi8
lB0eV3QA}{10.30.172.196}{10.30.172.196:9300}{dim}{ml.machine_memory=67368509440, ml.max_open_jobs=20, xpack.installed=true}], current }, term: 5, version: 1182, reason: becoming candidate: onLeaderFailure

Caused by: org.elasticsearch.transport.RemoteTransportException: [blabla-2][10.30.172.196:9300][internal:coordination/fault_detection/leader_check]
Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: leader check from unknown node

Tex · September 21, 2019, 3:14pm

For discovery.seed_hosts: , do i have to enter all nodes including master and no master also ?

Tex · September 22, 2019, 8:35am

No one on 7.3.2 ?

DavidTurner · September 22, 2019, 9:32am

This means the node was already removed from the cluster. There will an earlier message saying why.

No, that setting should only mention master-eligible nodes. It doesn't sound like a discovery problem, however, because the node must have joined the cluster to get the message you are seeing.

Tex · September 22, 2019, 10:03am

ok tx.

I try several setup and try to put also all nodes as master but still the same problem.

after around 2h all monitoinr failed on every nodes, kibana graf show a cut on every nodes

log says

627x145](upload://84qRbupyNUbijgR66AMIdAqmlCr.png) [2019-09-22T11:40:59,739][DEBUG][o.e.a.a.c.s.TransportClusterStateAction] [blabla-1] no known master node, scheduling a retry
[2019-09-22T11:40:59,747][DEBUG][o.e.a.a.c.s.TransportClusterStateAction] [blabla-1] timed out while retrying [cluster:monitor/state] after failure (timeout [30s])
[2019-09-22T11:40:59,747][WARN ][r.suppressed ] [blabla-1] path: /_cluster/settings, params: {include_defaults=true}
org.elasticsearch.discovery.MasterNotDiscoveredException: null

Tex · September 22, 2019, 10:05am

i've exactly the same setup in test and it runs since 2 months without any errors.

Tex · September 22, 2019, 10:06am

everytime it take 1h30 - 2h and all failed. i'm lost.

DavidTurner · September 22, 2019, 11:02am

This node can't find the elected master node. I expect there are other log messages saying why.

Denis_Lamanov · September 26, 2019, 12:10pm

Have just updated to 7.3.2 from 6.8 and having the same issue
No any problems with hardware or network, but node periodically lost from cluster
In logs:
Caused by: org.elasticsearch.transport.RemoteTransportException: [node-02][10.1.3.112:9300][internal:coordination/fault_detection/leader_check]
Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: leader check from unknown node

DavidTurner · September 26, 2019, 12:13pm

As I said above, leader check from unknown node means the node was already removed from the cluster and there will be an earlier message saying why. Look for the string node-left.

Denis_Lamanov · September 26, 2019, 1:30pm

Grep logs for "node-left":

gist.github.com

https://gist.github.com/UkrZilla/5d3aed533fe89c3ff5fc4bbe2568b193

node-left-reason

[2019-09-26T11:55:27,652][INFO ][o.e.c.s.MasterService    ] [node-02] node-left[{node-00}{A9AdXLb5QA-ZMcicCn26OQ}{S9uJF_S1QNC1SsgTqmY3bQ}{10.1.3.110}{10.1.3.110:9300}{dim}{ml.machine_memory=134928560128, ml.max_open_jobs=20, xpack.installed=true} lagging], term: 3, version: 2093, reason: removed {{node-00}{A9AdXLb5QA-ZMcicCn26OQ}{S9uJF_S1QNC1SsgTqmY3bQ}{10.1.3.110}{10.1.3.110:9300}{dim}{ml.machine_memory=134928560128, ml.max_open_jobs=20, xpack.installed=true},}
[2019-09-26T11:56:15,138][INFO ][o.e.c.s.MasterService    ] [node-02] node-left[{node-00}{A9AdXLb5QA-ZMcicCn26OQ}{S9uJF_S1QNC1SsgTqmY3bQ}{10.1.3.110}{10.1.3.110:9300}{dim}{ml.machine_memory=134928560128, ml.max_open_jobs=20, xpack.installed=true} disconnected], term: 3, version: 2096, reason: removed {{node-00}{A9AdXLb5QA-ZMcicCn26OQ}{S9uJF_S1QNC1SsgTqmY3bQ}{10.1.3.110}{10.1.3.110:9300}{dim}{ml.machine_memory=134928560128, ml.max_open_jobs=20, xpack.installed=true},}
[2019-09-26T11:56:40,748][INFO ][o.e.c.s.MasterService    ] [node-02] node-left[{node-00}{A9AdXLb5QA-ZMcicCn26OQ}{S9uJF_S1QNC1SsgTqmY3bQ}{10.1.3.110}{10.1.3.110:9300}{dim}{ml.machine_memory=134928560128, ml.max_open_jobs=20, xpack.installed=true} disconnected], term: 3, version: 2100, reason: removed {{node-00}{A9AdXLb5QA-ZMcicCn26OQ}{S9uJF_S1QNC1SsgTqmY3bQ}{10.1.3.110}{10.1.3.110:9300}{dim}{ml.machine_memory=134928560128, ml.max_open_jobs=20, xpack.installed=true},}
[2019-09-26T11:57:07,778][INFO ][o.e.c.s.MasterService    ] [node-02] node-left[{node-00}{A9AdXLb5QA-ZMcicCn26OQ}{S9uJF_S1QNC1SsgTqmY3bQ}{10.1.3.110}{10.1.3.110:9300}{dim}{ml.machine_memory=134928560128, ml.max_open_jobs=20, xpack.installed=true} disconnected], term: 3, version: 2103, reason: removed {{node-00}{A9AdXLb5QA-ZMcicCn26OQ}{S9uJF_S1QNC1SsgTqmY3bQ}{10.1.3.110}{10.1.3.110:9300}{dim}{ml.machine_memory=134928560128, ml.max_open_jobs=20, xpack.installed=true},}
[2019-09-26T11:57:40,260][INFO ][o.e.c.s.MasterService    ] [node-02] node-left[{node-00}{A9AdXLb5QA-ZMcicCn26OQ}{S9uJF_S1QNC1SsgTqmY3bQ}{10.1.3.110}{10.1.3.110:9300}{dim}{ml.machine_memory=134928560128, ml.max_open_jobs=20, xpack.installed=true} disconnected], term: 3, version: 2106, reason: removed {{node-00}{A9AdXLb5QA-ZMcicCn26OQ}{S9uJF_S1QNC1SsgTqmY3bQ}{10.1.3.110}{10.1.3.110:9300}{dim}{ml.machine_memory=134928560128, ml.max_open_jobs=20, xpack.installed=true},}
[2019-09-26T11:58:15,826][INFO ][o.e.c.s.MasterService    ] [node-02] node-left[{node-00}{A9AdXLb5QA-ZMcicCn26OQ}{S9uJF_S1QNC1SsgTqmY3bQ}{10.1.3.110}{10.1.3.110:9300}{dim}{ml.machine_memory=134928560128, ml.max_open_jobs=20, xpack.installed=true} disconnected], term: 3, version: 2109, reason: removed {{node-00}{A9AdXLb5QA-ZMcicCn26OQ}{S9uJF_S1QNC1SsgTqmY3bQ}{10.1.3.110}{10.1.3.110:9300}{dim}{ml.machine_memory=134928560128, ml.max_open_jobs=20, xpack.installed=true},}
[2019-09-26T11:58:48,162][INFO ][o.e.c.s.MasterService    ] [node-02] node-left[{node-00}{A9AdXLb5QA-ZMcicCn26OQ}{S9uJF_S1QNC1SsgTqmY3bQ}{10.1.3.110}{10.1.3.110:9300}{dim}{ml.machine_memory=134928560128, ml.max_open_jobs=20, xpack.installed=true} disconnected], term: 3, version: 2112, reason: removed {{node-00}{A9AdXLb5QA-ZMcicCn26OQ}{S9uJF_S1QNC1SsgTqmY3bQ}{10.1.3.110}{10.1.3.110:9300}{dim}{ml.machine_memory=134928560128, ml.max_open_jobs=20, xpack.installed=true},}
[2019-09-26T11:59:42,812][INFO ][o.e.c.s.MasterService    ] [node-02] node-left[{node-00}{A9AdXLb5QA-ZMcicCn26OQ}{S9uJF_S1QNC1SsgTqmY3bQ}{10.1.3.110}{10.1.3.110:9300}{dim}{ml.machine_memory=134928560128, ml.max_open_jobs=20, xpack.installed=true} disconnected], term: 3, version: 2115, reason: removed {{node-00}{A9AdXLb5QA-ZMcicCn26OQ}{S9uJF_S1QNC1SsgTqmY3bQ}{10.1.3.110}{10.1.3.110:9300}{dim}{ml.machine_memory=134928560128, ml.max_open_jobs=20, xpack.installed=true},}
[2019-09-26T12:00:58,507][INFO ][o.e.c.s.MasterService    ] [node-02] node-left[{node-00}{A9AdXLb5QA-ZMcicCn26OQ}{S9uJF_S1QNC1SsgTqmY3bQ}{10.1.3.110}{10.1.3.110:9300}{dim}{ml.machine_memory=134928560128, ml.max_open_jobs=20, xpack.installed=true} disconnected], term: 3, version: 2118, reason: removed {{node-00}{A9AdXLb5QA-ZMcicCn26OQ}{S9uJF_S1QNC1SsgTqmY3bQ}{10.1.3.110}{10.1.3.110:9300}{dim}{ml.machine_memory=134928560128, ml.max_open_jobs=20, xpack.installed=true},}
[2019-09-26T12:01:28,539][WARN ][o.e.c.s.MasterService    ] [node-02] cluster state update task [node-left[{node-00}{A9AdXLb5QA-ZMcicCn26OQ}{S9uJF_S1QNC1SsgTqmY3bQ}{10.1.3.110}{10.1.3.110:9300}{dim}{ml.machine_memory=134928560128, ml.max_open_jobs=20, xpack.installed=true} disconnected]] took [31s] which is above the warn threshold of 30s

This file has been truncated. show original

DavidTurner · September 26, 2019, 1:40pm

Thanks. The first one is the one we're after:

[2019-09-26T11:55:27,652][INFO ][o.e.c.s.MasterService ] [node-02] node-left[{node-00}{A9AdXLb5QA-ZMcicCn26OQ}{S9uJF_S1QNC1SsgTqmY3bQ}{10.1.3.110}{10.1.3.110:9300}{dim}{ml.machine_memory=134928560128, ml.max_open_jobs=20, xpack.installed=true} lagging], term: 3, version: 2093, reason: removed {{node-00}{A9AdXLb5QA-ZMcicCn26OQ}{S9uJF_S1QNC1SsgTqmY3bQ}{10.1.3.110}{10.1.3.110:9300}{dim}{ml.machine_memory=134928560128, ml.max_open_jobs=20, xpack.installed=true},}

Note the lagging - this means this node took more than 2 minutes to process a cluster state update, which Elasticsearch 7.x treats as a failure of this node.

The logging around this is improved in 7.4 (and even more in 7.5) but in 7.3 I suggest you add

logger.org.elasticsearch.gateway.MetaStateService: TRACE
logger.org.elasticsearch.cluster.service: TRACE

This will give much more detail about why it is processing cluster states so slowly.

Denis_Lamanov · September 26, 2019, 3:08pm

gist.github.com

https://gist.github.com/UkrZilla/4b5c77bd5cb5ec4ce08fdbbddcbeb3f4

node-removed

[2019-09-26T11:55:28,092][INFO ][o.e.c.s.ClusterApplierService] [node-04] removed {{node-00}{A9AdXLb5QA-ZMcicCn26OQ}{S9uJF_S1QNC1SsgTqmY3bQ}{10.1.3.110}{10.1.3.110:9300}{dim}{ml.machine_memory=134928560128, ml.max_open_jobs=20, xpack.installed=true},}, term: 3, version: 2093, reason: ApplyCommitRequest{term=3, version=2093, sourceNode={node-02}{GcIBJhb1TqC4ZPWlsEo5ZQ}{PXxXuiBjRqC44d0WyoGdpQ}{10.1.3.112}{10.1.3.112:9300}{dim}{ml.machine_memory=134928560128, ml.max_open_jobs=20, xpack.installed=true}}
[2019-09-26T11:56:15,155][INFO ][o.e.c.s.ClusterApplierService] [node-04] removed {{node-00}{A9AdXLb5QA-ZMcicCn26OQ}{S9uJF_S1QNC1SsgTqmY3bQ}{10.1.3.110}{10.1.3.110:9300}{dim}{ml.machine_memory=134928560128, ml.max_open_jobs=20, xpack.installed=true},}, term: 3, version: 2096, reason: ApplyCommitRequest{term=3, version=2096, sourceNode={node-02}{GcIBJhb1TqC4ZPWlsEo5ZQ}{PXxXuiBjRqC44d0WyoGdpQ}{10.1.3.112}{10.1.3.112:9300}{dim}{ml.machine_memory=134928560128, ml.max_open_jobs=20, xpack.installed=true}}
[2019-09-26T11:56:40,775][INFO ][o.e.c.s.ClusterApplierService] [node-04] removed {{node-00}{A9AdXLb5QA-ZMcicCn26OQ}{S9uJF_S1QNC1SsgTqmY3bQ}{10.1.3.110}{10.1.3.110:9300}{dim}{ml.machine_memory=134928560128, ml.max_open_jobs=20, xpack.installed=true},}, term: 3, version: 2100, reason: ApplyCommitRequest{term=3, version=2100, sourceNode={node-02}{GcIBJhb1TqC4ZPWlsEo5ZQ}{PXxXuiBjRqC44d0WyoGdpQ}{10.1.3.112}{10.1.3.112:9300}{dim}{ml.machine_memory=134928560128, ml.max_open_jobs=20, xpack.installed=true}}
[2019-09-26T11:57:07,793][INFO ][o.e.c.s.ClusterApplierService] [node-04] removed {{node-00}{A9AdXLb5QA-ZMcicCn26OQ}{S9uJF_S1QNC1SsgTqmY3bQ}{10.1.3.110}{10.1.3.110:9300}{dim}{ml.machine_memory=134928560128, ml.max_open_jobs=20, xpack.installed=true},}, term: 3, version: 2103, reason: ApplyCommitRequest{term=3, version=2103, sourceNode={node-02}{GcIBJhb1TqC4ZPWlsEo5ZQ}{PXxXuiBjRqC44d0WyoGdpQ}{10.1.3.112}{10.1.3.112:9300}{dim}{ml.machine_memory=134928560128, ml.max_open_jobs=20, xpack.installed=true}}
[2019-09-26T11:57:43,841][INFO ][o.e.c.s.ClusterApplierService] [node-04] removed {{node-00}{A9AdXLb5QA-ZMcicCn26OQ}{S9uJF_S1QNC1SsgTqmY3bQ}{10.1.3.110}{10.1.3.110:9300}{dim}{ml.machine_memory=134928560128, ml.max_open_jobs=20, xpack.installed=true},}, term: 3, version: 2106, reason: ApplyCommitRequest{term=3, version=2106, sourceNode={node-02}{GcIBJhb1TqC4ZPWlsEo5ZQ}{PXxXuiBjRqC44d0WyoGdpQ}{10.1.3.112}{10.1.3.112:9300}{dim}{ml.machine_memory=134928560128, ml.max_open_jobs=20, xpack.installed=true}}
[2019-09-26T11:58:15,842][INFO ][o.e.c.s.ClusterApplierService] [node-04] removed {{node-00}{A9AdXLb5QA-ZMcicCn26OQ}{S9uJF_S1QNC1SsgTqmY3bQ}{10.1.3.110}{10.1.3.110:9300}{dim}{ml.machine_memory=134928560128, ml.max_open_jobs=20, xpack.installed=true},}, term: 3, version: 2109, reason: ApplyCommitRequest{term=3, version=2109, sourceNode={node-02}{GcIBJhb1TqC4ZPWlsEo5ZQ}{PXxXuiBjRqC44d0WyoGdpQ}{10.1.3.112}{10.1.3.112:9300}{dim}{ml.machine_memory=134928560128, ml.max_open_jobs=20, xpack.installed=true}}
[2019-09-26T11:58:48,180][INFO ][o.e.c.s.ClusterApplierService] [node-04] removed {{node-00}{A9AdXLb5QA-ZMcicCn26OQ}{S9uJF_S1QNC1SsgTqmY3bQ}{10.1.3.110}{10.1.3.110:9300}{dim}{ml.machine_memory=134928560128, ml.max_open_jobs=20, xpack.installed=true},}, term: 3, version: 2112, reason: ApplyCommitRequest{term=3, version=2112, sourceNode={node-02}{GcIBJhb1TqC4ZPWlsEo5ZQ}{PXxXuiBjRqC44d0WyoGdpQ}{10.1.3.112}{10.1.3.112:9300}{dim}{ml.machine_memory=134928560128, ml.max_open_jobs=20, xpack.installed=true}}
[2019-09-26T11:59:42,829][INFO ][o.e.c.s.ClusterApplierService] [node-04] removed {{node-00}{A9AdXLb5QA-ZMcicCn26OQ}{S9uJF_S1QNC1SsgTqmY3bQ}{10.1.3.110}{10.1.3.110:9300}{dim}{ml.machine_memory=134928560128, ml.max_open_jobs=20, xpack.installed=true},}, term: 3, version: 2115, reason: ApplyCommitRequest{term=3, version=2115, sourceNode={node-02}{GcIBJhb1TqC4ZPWlsEo5ZQ}{PXxXuiBjRqC44d0WyoGdpQ}{10.1.3.112}{10.1.3.112:9300}{dim}{ml.machine_memory=134928560128, ml.max_open_jobs=20, xpack.installed=true}}
[2019-09-26T12:01:35,245][INFO ][o.e.c.s.ClusterApplierService] [node-04] removed {{node-00}{A9AdXLb5QA-ZMcicCn26OQ}{S9uJF_S1QNC1SsgTqmY3bQ}{10.1.3.110}{10.1.3.110:9300}{dim}{ml.machine_memory=134928560128, ml.max_open_jobs=20, xpack.installed=true},}, term: 3, version: 2119, reason: ApplyCommitRequest{term=3, version=2118, sourceNode={node-02}{GcIBJhb1TqC4ZPWlsEo5ZQ}{PXxXuiBjRqC44d0WyoGdpQ}{10.1.3.112}{10.1.3.112:9300}{dim}{ml.machine_memory=134928560128, ml.max_open_jobs=20, xpack.installed=true}}
[2019-09-26T12:02:23,174][INFO ][o.e.c.s.ClusterApplierService] [node-04] removed {{node-00}{A9AdXLb5QA-ZMcicCn26OQ}{S9uJF_S1QNC1SsgTqmY3bQ}{10.1.3.110}{10.1.3.110:9300}{dim}{ml.machine_memory=134928560128, ml.max_open_jobs=20, xpack.installed=true},}, term: 3, version: 2121, reason: ApplyCommitRequest{term=3, version=2121, sourceNode={node-02}{GcIBJhb1TqC4ZPWlsEo5ZQ}{PXxXuiBjRqC44d0WyoGdpQ}{10.1.3.112}{10.1.3.112:9300}{dim}{ml.machine_memory=134928560128, ml.max_open_jobs=20, xpack.installed=true}}

This file has been truncated. show original

This log is not from TRACE but contains reason

coudenysj · October 17, 2019, 3:03pm

We have exactly the same issue (7.3.1) where get this error: https://github.com/elastic/elasticsearch/blob/v7.3.1/server/src/main/java/org/elasticsearch/cluster/coordination/LeaderChecker.java#L180.

We see a lot of timeouts too, but don't really know where they come from. The load isn't that extraordinary.

DavidTurner · October 17, 2019, 3:11pm

Repeating my last comment:

The logging around this is improved in 7.4 (and even more in 7.5) but in 7.3 I suggest you add
logger.org.elasticsearch.gateway.MetaStateService: TRACE
logger.org.elasticsearch.cluster.service: TRACE
This will give much more detail about why it is processing cluster states so slowly.

system · November 14, 2019, 3:11pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Master nodes do not detect the other masters after service restart Elasticsearch	10	5078	August 21, 2019
Cluster unstable after upgrade to ES 7.4.2 Elasticsearch	1	479	December 18, 2019
New nodes do not consistently find existing master Elasticsearch	2	260	July 6, 2017
Elasticsearch 3 nodes cluster not joining with each other Elasticsearch	17	2312	July 30, 2021
Problem restarting cluster on 6.6.1 Elasticsearch	3	438	September 28, 2019

Unstable after starting nodes

Related topics