Elasticsearch adds and remove nodes frequently

This post was flagged by the community and is temporarily hidden.

Can you explain your topology a little more?
Are all nodes in the same location?

@warkolm Sure! here is my elasticsearch.yml from all machines

ES-01 
    cluster.name: india_farmers
    cluster.routing.allocation.awareness.attributes: zone
    cluster.routing.allocation.awareness.force.zone.values: zone1,zone2
    node.name: elastic-01
    node.attr.zone: zone1
    node.master: true
    node.data: true
    node.ingest: true
    path.data: /var/lib/elasticsearch
    path.logs: /var/log/elasticsearch
    network.host: 10.0.87.207
    discovery.seed_hosts:
       - 10.0.87.207:9300
       - 10.0.87.208:9300
       - 10.0.87.209:9300
       - 10.0.87.210:9300
       - 10.0.225.232:9300
    cluster.initial_master_nodes:
       - 10.0.87.207:9300
       - 10.0.87.209:9300
       - 10.0.225.232:9300
    xpack.security.enabled: false
    path.repo: ["/opt/elasticsearch-backup"]

ES-02

    cluster.name: india_farmers
    node.name: elastic-02
    node.attr.zone: zone1
    node.master: false
    node.data: true
    node.ingest: true
    path.data: /var/lib/elasticsearch
    path.logs: /var/log/elasticsearch
    network.host: 10.0.87.208
    discovery.seed_hosts:
       - 10.0.87.207:9300
       - 10.0.87.208:9300
       - 10.0.87.209:9300
       - 10.0.87.210:9300
       - 10.0.225.232:9300
    cluster.initial_master_nodes:
       - 10.0.87.207:9300
       - 10.0.87.209:9300
       - 10.0.225.232:9300
    xpack.security.enabled: false
    path.repo: ["/opt/elasticsearch-backup"]

ES-03

    cluster.name: india_farmers
    cluster.routing.allocation.awareness.attributes: zone
    cluster.routing.allocation.awareness.force.zone.values: zone1,zone2
    node.name: elastic-03
    node.attr.zone: zone2
    node.master: true
    node.data: true
    node.ingest: true
    path.data: /var/lib/elasticsearch
    path.logs: /var/log/elasticsearch
    network.host: 10.0.87.209
    discovery.seed_hosts:
       - 10.0.87.207:9300
       - 10.0.87.208:9300
       - 10.0.87.209:9300
       - 10.0.87.210:9300
       - 10.0.225.232:9300
    cluster.initial_master_nodes:
       - 10.0.87.207:9300
       - 10.0.87.209:9300
       - 10.0.225.232:9300
    xpack.security.enabled: false
    path.repo: ["/opt/elasticsearch-backup"]

ES-04

    cluster.name: india_farmers
    node.name: elastic-04
    node.attr.zone: zone2
    node.master: false
    node.data: true
    node.ingest: true
    path.data: /var/lib/elasticsearch
    path.logs: /var/log/elasticsearch
    network.host: 10.0.87.210
    discovery.seed_hosts:
       - 10.0.87.207:9300
       - 10.0.87.208:9300
       - 10.0.87.209:9300
       - 10.0.87.210:9300
       - 10.0.225.232:9300
    cluster.initial_master_nodes:
       - 10.0.87.207:9300
       - 10.0.87.209:9300
       - 10.0.225.232:9300
    xpack.security.enabled: false
    path.repo: ["/opt/elasticsearch-backup"]

ES-05

    cluster.name: india_farmers
    node.name: elastic-05
    node.master: true
    cluster.routing.allocation.awareness.attributes: zone
    cluster.routing.allocation.awareness.force.zone.values: zone1,zone2
    node.data: false
    node.ingest: true
    path.data: /var/lib/elasticsearch
    path.logs: /var/log/elasticsearch
    network.host: 10.0.225.232
    discovery.seed_hosts:
       - 10.0.87.207:9300
       - 10.0.87.208:9300
       - 10.0.87.209:9300
       - 10.0.87.210:9300
       - 10.0.225.232:9300
    cluster.initial_master_nodes:
       - 10.0.87.207:9300
       - 10.0.87.209:9300
       - 10.0.225.232:9300
    xpack.security.enabled: false
    path.repo: ["/opt/elasticsearch-backup"]

And my physical layout

Datacenter1
Elastic-01
Elastic-02

Datacenter2
Elastic-03
Elastic-04

Datacenter3
Elastic-05

all this datacenters are in the same city and here is ping between the nodes in different DC

from Elastic-01 (Datacenter1) to Elastic-03 (Datacenter2)

PING 10.0.87.209 (10.0.87.209) 56(84) bytes of data.
64 bytes from 10.0.87.209: icmp_seq=1 ttl=64 time=0.235 ms
64 bytes from 10.0.87.209: icmp_seq=2 ttl=64 time=0.217 ms
64 bytes from 10.0.87.209: icmp_seq=3 ttl=64 time=0.255 ms
64 bytes from 10.0.87.209: icmp_seq=4 ttl=64 time=0.251 ms
64 bytes from 10.0.87.209: icmp_seq=5 ttl=64 time=0.262 ms
64 bytes from 10.0.87.209: icmp_seq=6 ttl=64 time=0.261 ms
64 bytes from 10.0.87.209: icmp_seq=7 ttl=64 time=0.218 ms
64 bytes from 10.0.87.209: icmp_seq=8 ttl=64 time=0.229 ms
64 bytes from 10.0.87.209: icmp_seq=9 ttl=64 time=0.264 ms
64 bytes from 10.0.87.209: icmp_seq=10 ttl=64 time=0.218 ms
ping from Elastic-01 to Elastic-05
ping -c 10 10.0.225.232
PING 10.0.225.232 (10.0.225.232) 56(84) bytes of data.
64 bytes from 10.0.225.232: icmp_seq=1 ttl=61 time=1.59 ms
64 bytes from 10.0.225.232: icmp_seq=2 ttl=61 time=1.31 ms
64 bytes from 10.0.225.232: icmp_seq=3 ttl=61 time=1.27 ms
64 bytes from 10.0.225.232: icmp_seq=4 ttl=61 time=1.45 ms
64 bytes from 10.0.225.232: icmp_seq=5 ttl=61 time=1.29 ms
64 bytes from 10.0.225.232: icmp_seq=6 ttl=61 time=1.24 ms
64 bytes from 10.0.225.232: icmp_seq=7 ttl=61 time=1.18 ms
64 bytes from 10.0.225.232: icmp_seq=8 ttl=61 time=1.54 ms
64 bytes from 10.0.225.232: icmp_seq=9 ttl=61 time=1.19 ms
64 bytes from 10.0.225.232: icmp_seq=10 ttl=61 time=1.13 ms

The elected master will be logging a message containing the string node-left when the node leaves. Can you share all copies of this message, so we can see the pattern over time?

You want to see logs from the current master node, right?

Yes.

@DavidTurner Here is the log from the master node

[2021-02-03T01:45:31,310][INFO ][o.e.c.s.ClusterApplierService] [elastic-03] removed {{elastic-05}{Uhj07HVbQQ2IEno38VzI9A}{GRcJlXQiTq6tqJMZ9E_REw}{10.0.225.232}{10.0.225.232:9300}{ilm}{ml.machine_memory=16656920576, ml.max_open_jobs=20, xpack.installed=true},}, term: 3, version: 12993, reason: Publication{term=3, version=12993}
[2021-02-03T01:45:33,538][INFO ][o.e.c.s.MasterService    ] [elastic-03] node-join[{elastic-05}{Uhj07HVbQQ2IEno38VzI9A}{GRcJlXQiTq6tqJMZ9E_REw}{10.0.225.232}{10.0.225.232:9300}{ilm}{ml.machine_memory=16656920576, ml.max_open_jobs=20, xpack.installed=true} join existing leader], term: 3, version: 12994, reason: added {{elastic-05}{Uhj07HVbQQ2IEno38VzI9A}{GRcJlXQiTq6tqJMZ9E_REw}{10.0.225.232}{10.0.225.232:9300}{ilm}{ml.machine_memory=16656920576, ml.max_open_jobs=20, xpack.installed=true},}
[2021-02-03T01:45:33,563][INFO ][o.e.c.s.ClusterApplierService] [elastic-03] added {{elastic-05}{Uhj07HVbQQ2IEno38VzI9A}{GRcJlXQiTq6tqJMZ9E_REw}{10.0.225.232}{10.0.225.232:9300}{ilm}{ml.machine_memory=16656920576, ml.max_open_jobs=20, xpack.installed=true},}, term: 3, version: 12994, reason: Publication{term=3, version=12994}
[2021-02-03T01:50:34,174][INFO ][o.e.c.s.MasterService    ] [elastic-03] node-left[{elastic-05}{Uhj07HVbQQ2IEno38VzI9A}{GRcJlXQiTq6tqJMZ9E_REw}{10.0.225.232}{10.0.225.232:9300}{ilm}{ml.machine_memory=16656920576, ml.max_open_jobs=20, xpack.installed=true} disconnected], term: 3, version: 12996, reason: removed {{elastic-05}{Uhj07HVbQQ2IEno38VzI9A}{GRcJlXQiTq6tqJMZ9E_REw}{10.0.225.232}{10.0.225.232:9300}{ilm}{ml.machine_memory=16656920576, ml.max_open_jobs=20, xpack.installed=true},}
[2021-02-03T01:50:34,184][INFO ][o.e.c.s.ClusterApplierService] [elastic-03] removed {{elastic-05}{Uhj07HVbQQ2IEno38VzI9A}{GRcJlXQiTq6tqJMZ9E_REw}{10.0.225.232}{10.0.225.232:9300}{ilm}{ml.machine_memory=16656920576, ml.max_open_jobs=20, xpack.installed=true},}, term: 3, version: 12996, reason: Publication{term=3, version=12996}
[2021-02-03T01:50:36,626][INFO ][o.e.c.s.MasterService    ] [elastic-03] node-join[{elastic-05}{Uhj07HVbQQ2IEno38VzI9A}{GRcJlXQiTq6tqJMZ9E_REw}{10.0.225.232}{10.0.225.232:9300}{ilm}{ml.machine_memory=16656920576, ml.max_open_jobs=20, xpack.installed=true} join existing leader], term: 3, version: 12997, reason: added {{elastic-05}{Uhj07HVbQQ2IEno38VzI9A}{GRcJlXQiTq6tqJMZ9E_REw}{10.0.225.232}{10.0.225.232:9300}{ilm}{ml.machine_memory=16656920576, ml.max_open_jobs=20, xpack.installed=true},}
[2021-02-03T01:50:36,648][INFO ][o.e.c.s.ClusterApplierService] [elastic-03] added {{elastic-05}{Uhj07HVbQQ2IEno38VzI9A}{GRcJlXQiTq6tqJMZ9E_REw}{10.0.225.232}{10.0.225.232:9300}{ilm}{ml.machine_memory=16656920576, ml.max_open_jobs=20, xpack.installed=true},}, term: 3, version: 12997, reason: Publication{term=3, version=12997}
[2021-02-03T01:54:06,640][INFO ][o.e.c.r.a.DiskThresholdMonitor] [elastic-03] skipping monitor as a check is already in progress
[2021-02-03T01:55:36,540][INFO ][o.e.c.s.MasterService    ] [elastic-03] node-left[{elastic-05}{Uhj07HVbQQ2IEno38VzI9A}{GRcJlXQiTq6tqJMZ9E_REw}{10.0.225.232}{10.0.225.232:9300}{ilm}{ml.machine_memory=16656920576, ml.max_open_jobs=20, xpack.installed=true} disconnected], term: 3, version: 12999, reason: removed {{elastic-05}{Uhj07HVbQQ2IEno38VzI9A}{GRcJlXQiTq6tqJMZ9E_REw}{10.0.225.232}{10.0.225.232:9300}{ilm}{ml.machine_memory=16656920576, ml.max_open_jobs=20, xpack.installed=true},}
[2021-02-03T01:55:36,550][INFO ][o.e.c.s.ClusterApplierService] [elastic-03] removed {{elastic-05}{Uhj07HVbQQ2IEno38VzI9A}{GRcJlXQiTq6tqJMZ9E_REw}{10.0.225.232}{10.0.225.232:9300}{ilm}{ml.machine_memory=16656920576, ml.max_open_jobs=20, xpack.installed=true},}, term: 3, version: 12999, reason: Publication{term=3, version=12999}
[2021-02-03T01:55:38,793][INFO ][o.e.c.s.MasterService    ] [elastic-03] node-join[{elastic-05}{Uhj07HVbQQ2IEno38VzI9A}{GRcJlXQiTq6tqJMZ9E_REw}{10.0.225.232}{10.0.225.232:9300}{ilm}{ml.machine_memory=16656920576, ml.max_open_jobs=20, xpack.installed=true} join existing leader], term: 3, version: 13000, reason: added {{elastic-05}{Uhj07HVbQQ2IEno38VzI9A}{GRcJlXQiTq6tqJMZ9E_REw}{10.0.225.232}{10.0.225.232:9300}{ilm}{ml.machine_memory=16656920576, ml.max_open_jobs=20, xpack.installed=true},}
[2021-02-03T01:55:38,815][INFO ][o.e.c.s.ClusterApplierService] [elastic-03] added {{elastic-05}{Uhj07HVbQQ2IEno38VzI9A}{GRcJlXQiTq6tqJMZ9E_REw}{10.0.225.232}{10.0.225.232:9300}{ilm}{ml.machine_memory=16656920576, ml.max_open_jobs=20, xpack.installed=true},}, term: 3, version: 13000, reason: Publication{term=3, version=13000}
[2021-02-03T02:00:38,161][INFO ][o.e.c.s.MasterService    ] [elastic-03] node-left[{elastic-05}{Uhj07HVbQQ2IEno38VzI9A}{GRcJlXQiTq6tqJMZ9E_REw}{10.0.225.232}{10.0.225.232:9300}{ilm}{ml.machine_memory=16656920576, ml.max_open_jobs=20, xpack.installed=true} disconnected], term: 3, version: 13002, reason: removed {{elastic-05}{Uhj07HVbQQ2IEno38VzI9A}{GRcJlXQiTq6tqJMZ9E_REw}{10.0.225.232}{10.0.225.232:9300}{ilm}{ml.machine_memory=16656920576, ml.max_open_jobs=20, xpack.installed=true},}
[2021-02-03T02:00:38,174][INFO ][o.e.c.s.ClusterApplierService] [elastic-03] removed {{elastic-05}{Uhj07HVbQQ2IEno38VzI9A}{GRcJlXQiTq6tqJMZ9E_REw}{10.0.225.232}{10.0.225.232:9300}{ilm}{ml.machine_memory=16656920576, ml.max_open_jobs=20, xpack.installed=true},}, term: 3, version: 13002, reason: Publication{term=3, version=13002}
[2021-02-03T02:00:40,270][INFO ][o.e.c.s.MasterService    ] [elastic-03] node-join[{elastic-05}{Uhj07HVbQQ2IEno38VzI9A}{GRcJlXQiTq6tqJMZ9E_REw}{10.0.225.232}{10.0.225.232:9300}{ilm}{ml.machine_memory=16656920576, ml.max_open_jobs=20, xpack.installed=true} join existing leader], term: 3, version: 13003, reason: added {{elastic-05}{Uhj07HVbQQ2IEno38VzI9A}{GRcJlXQiTq6tqJMZ9E_REw}{10.0.225.232}{10.0.225.232:9300}{ilm}{ml.machine_memory=16656920576, ml.max_open_jobs=20, xpack.installed=true},}
[2021-02-03T02:00:40,297][INFO ][o.e.c.s.ClusterApplierService] [elastic-03] added {{elastic-05}{Uhj07HVbQQ2IEno38VzI9A}{GRcJlXQiTq6tqJMZ9E_REw}{10.0.225.232}{10.0.225.232:9300}{ilm}{ml.machine_memory=16656920576, ml.max_open_jobs=20, xpack.installed=true},}, term: 3, version: 13003, reason: Publication{term=3, version=13003}
[2021-02-03T02:05:40,147][INFO ][o.e.c.s.MasterService    ] [elastic-03] node-left[{elastic-05}{Uhj07HVbQQ2IEno38VzI9A}{GRcJlXQiTq6tqJMZ9E_REw}{10.0.225.232}{10.0.225.232:9300}{ilm}{ml.machine_memory=16656920576, ml.max_open_jobs=20, xpack.installed=true} disconnected], term: 3, version: 13005, reason: removed {{elastic-05}{Uhj07HVbQQ2IEno38VzI9A}{GRcJlXQiTq6tqJMZ9E_REw}{10.0.225.232}{10.0.225.232:9300}{ilm}{ml.machine_memory=16656920576, ml.max_open_jobs=20, xpack.installed=true},}
[2021-02-03T02:05:40,158][INFO ][o.e.c.s.ClusterApplierService] [elastic-03] removed {{elastic-05}{Uhj07HVbQQ2IEno38VzI9A}{GRcJlXQiTq6tqJMZ9E_REw}{10.0.225.232}{10.0.225.232:9300}{ilm}{ml.machine_memory=16656920576, ml.max_open_jobs=20, xpack.installed=true},}, term: 3, version: 13005, reason: Publication{term=3, version=13005}
[2021-02-03T02:05:42,839][INFO ][o.e.c.s.MasterService    ] [elastic-03] node-join[{elastic-05}{Uhj07HVbQQ2IEno38VzI9A}{GRcJlXQiTq6tqJMZ9E_REw}{10.0.225.232}{10.0.225.232:9300}{ilm}{ml.machine_memory=16656920576, ml.max_open_jobs=20, xpack.installed=true} join existing leader], term: 3, version: 13006, reason: added {{elastic-05}{Uhj07HVbQQ2IEno38VzI9A}{GRcJlXQiTq6tqJMZ9E_REw}{10.0.225.232}{10.0.225.232:9300}{ilm}{ml.machine_memory=16656920576, ml.max_open_jobs=20, xpack.installed=true},}
[2021-02-03T02:05:42,864][INFO ][o.e.c.s.ClusterApplierService] [elastic-03] added {{elastic-05}{Uhj07HVbQQ2IEno38VzI9A}{GRcJlXQiTq6tqJMZ9E_REw}{10.0.225.232}{10.0.225.232:9300}{ilm}{ml.machine_memory=16656920576, ml.max_open_jobs=20, xpack.installed=true},}, term: 3, version: 13006, reason: Publication{term=3, version=13006}
[2021-02-03T02:10:42,103][INFO ][o.e.c.s.MasterService    ] [elastic-03] node-left[{elastic-05}{Uhj07HVbQQ2IEno38VzI9A}{GRcJlXQiTq6tqJMZ9E_REw}{10.0.225.232}{10.0.225.232:9300}{ilm}{ml.machine_memory=16656920576, ml.max_open_jobs=20, xpack.installed=true} disconnected], term: 3, version: 13008, reason: removed {{elastic-05}{Uhj07HVbQQ2IEno38VzI9A}{GRcJlXQiTq6tqJMZ9E_REw}{10.0.225.232}{10.0.225.232:9300}{ilm}{ml.machine_memory=16656920576, ml.max_open_jobs=20, xpack.installed=true},}
[2021-02-03T02:10:42,115][INFO ][o.e.c.s.ClusterApplierService] [elastic-03] removed {{elastic-05}{Uhj07HVbQQ2IEno38VzI9A}{GRcJlXQiTq6tqJMZ9E_REw}{10.0.225.232}{10.0.225.232:9300}{ilm}{ml.machine_memory=16656920576, ml.max_open_jobs=20, xpack.installed=true},}, term: 3, version: 13008, reason: Publication{term=3, version=13008}

disconnected means a TCP connection was dropped. Since it happens every 5 minutes it's probably something on the network between the nodes with a 5-minute timeout. See these docs for information:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.