Unstable after starting nodes

Hi,

ES: v7.3.2

I create a brand new cluster with 3 master nodes, no problem all running fine. but each time that i want to add a node ( no master ) after some time ( 1h ) the node disappear, monitoring show node was removed, and the cluster is very slow and unstable. then i stop the node and everything come back fine.

I try this with several other nodes in different vlan , same result.

Is there a trick in the discovery part ?

Here's my config of master , for nodes i just put node.master: false

cluster.name: blabla
node.name: blabla-1
path.data: /opt/elasticsearch/data
path.logs: /var/log/elasticsearch
#bootstrap.memory_lock: true
bootstrap.system_call_filter: false
network.host: site,local
http.port: 9200
discovery.seed_hosts: ["blabla-1", "blabla-2","blabla-3"]
cluster.initial_master_nodes: ["blabla-1", "blabla-2","blabla-3"]
node.master: true
node.data: true

error it give when i add a node is :

master node changed {previous [{blabla-2}{XCPed0npS5m2Sub0AfqTQw}{38O_1NF2RWSFi8
lB0eV3QA}{10.30.172.196}{10.30.172.196:9300}{dim}{ml.machine_memory=67368509440, ml.max_open_jobs=20, xpack.installed=true}], current }, term: 5, version: 1182, reason: becoming candidate: onLeaderFailure

Caused by: org.elasticsearch.transport.RemoteTransportException: [blabla-2][10.30.172.196:9300][internal:coordination/fault_detection/leader_check]
Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: leader check from unknown node

For discovery.seed_hosts: , do i have to enter all nodes including master and no master also ?

No one on 7.3.2 ?

This means the node was already removed from the cluster. There will an earlier message saying why.

No, that setting should only mention master-eligible nodes. It doesn't sound like a discovery problem, however, because the node must have joined the cluster to get the message you are seeing.

ok tx.

I try several setup and try to put also all nodes as master but still the same problem.

after around 2h all monitoinr failed on every nodes, kibana graf show a cut on every nodes

log says

627x145](upload://84qRbupyNUbijgR66AMIdAqmlCr.png) [2019-09-22T11:40:59,739][DEBUG][o.e.a.a.c.s.TransportClusterStateAction] [blabla-1] no known master node, scheduling a retry
[2019-09-22T11:40:59,747][DEBUG][o.e.a.a.c.s.TransportClusterStateAction] [blabla-1] timed out while retrying [cluster:monitor/state] after failure (timeout [30s])
[2019-09-22T11:40:59,747][WARN ][r.suppressed ] [blabla-1] path: /_cluster/settings, params: {include_defaults=true}
org.elasticsearch.discovery.MasterNotDiscoveredException: null

i've exactly the same setup in test and it runs since 2 months without any errors.

everytime it take 1h30 - 2h and all failed. i'm lost.

This node can't find the elected master node. I expect there are other log messages saying why.

Have just updated to 7.3.2 from 6.8 and having the same issue
No any problems with hardware or network, but node periodically lost from cluster
In logs:
Caused by: org.elasticsearch.transport.RemoteTransportException: [node-02][10.1.3.112:9300][internal:coordination/fault_detection/leader_check]
Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: leader check from unknown node

As I said above, leader check from unknown node means the node was already removed from the cluster and there will be an earlier message saying why. Look for the string node-left.

Grep logs for "node-left":

Thanks. The first one is the one we're after:

[2019-09-26T11:55:27,652][INFO ][o.e.c.s.MasterService ] [node-02] node-left[{node-00}{A9AdXLb5QA-ZMcicCn26OQ}{S9uJF_S1QNC1SsgTqmY3bQ}{10.1.3.110}{10.1.3.110:9300}{dim}{ml.machine_memory=134928560128, ml.max_open_jobs=20, xpack.installed=true} lagging], term: 3, version: 2093, reason: removed {{node-00}{A9AdXLb5QA-ZMcicCn26OQ}{S9uJF_S1QNC1SsgTqmY3bQ}{10.1.3.110}{10.1.3.110:9300}{dim}{ml.machine_memory=134928560128, ml.max_open_jobs=20, xpack.installed=true},}

Note the lagging - this means this node took more than 2 minutes to process a cluster state update, which Elasticsearch 7.x treats as a failure of this node.

The logging around this is improved in 7.4 (and even more in 7.5) but in 7.3 I suggest you add

logger.org.elasticsearch.gateway.MetaStateService: TRACE
logger.org.elasticsearch.cluster.service: TRACE

This will give much more detail about why it is processing cluster states so slowly.

This log is not from TRACE but contains reason