Periodic disconnection of same data nodes

alissan · July 12, 2023, 9:29am

Hello,

I have a cluster with 3 master, 40 data nodes (d1,d2,...,d40).
First 5 data nodes have voting only master role.

Only the following data nodes have periodic abnormal behavior:

d11,d12,d13,d14,d15,d16,d17,d21,d22

These nodes disconnect from cluster every ~65 mins and rejoin after ~10 mins.

This chart for node counts by time (Jul 1-11):
High values indicate the number of nodes that should be (40), low values after the nodes are disconnected (31).

This chart for minute difference between node counts change times (Jul 1-11):
High values (~65 mins) show the time they are connected, low times (~11 mins) indicate the time they are disconnected.

elasticsearch.yml and jvm.options files are same for all data nodes.

Before disconnecting, the following error log occurs in master node log file:

[2023-07-12T00:10:12,946][ERROR][o.e.x.m.c.i.IndexStatsCollector] [m01] collector [index-stats] timed out when collecting data: nodes [Hs20tBbARLmfVIwfl_uq6g, aZFlTvfKR3KgoAwa9gHdLA, 4_Lx62u9Qsqwbzwz0A496Q, OJQBrR0URo2R95j7epmyag, CT0jbNdlQsypPonefjrrVw, wDIuVUurTZyfTbd-KZUkaw, TYepUP6qQpWbpXZQax8K5Q, wfUp9qdXQsqysmrQ1Bsl6A, UdKxMlRcTASjVujQr9EM4w] did not respond within [10s]

Config files and stats are here.

Note: In elasticsearch.yml file, discovery.seed_hosts value have 48 item, because i'm planning to add 5 new data nodes but installation not completed.

I'm collecting node counts for every mins. You can see this data in number_of_data_nodes_2023.07.log file.

Also i have disconnected nodes list by minute in disconnected_nodes_2023.07.log file.

Where should i check to fix this problem?

Thanks.

Christian_Dahlqvist · July 12, 2023, 10:01am

This does not make any sense as you have 3 master eligible nodes. What is the rationale behind this?

You should in my opinion never have more than one voting only master node as it is designed to act as a tiebreaker and only if you have an even number of master eligible nodes.

I would recommend making these normal data nodes and see if it has any effect.

Also, what. is the specification of the cluster in terms of hardware and type of storage used? Which version of Elasticsearch are you using?

DavidTurner · July 12, 2023, 10:17am

See these docs:

alissan · July 12, 2023, 1:01pm

Hi Christian,

I've seen in a document that no more than half of the master nodes should be shut down. This may cause data loss.
For this reason, I configured 5 data nodes as voting only masters.
If 2 master servers are accidentally shut down, the cluster may fail.

Is this information no longer valid?

master nodes: physical machines
data nodes: virtual machines on 3 different vmware hosts
disks: ssd, connected with fiber channel
elastic version: 8.6.2

Christian_Dahlqvist · July 12, 2023, 1:25pm

If you need to be ble to handle the loss of 2 master eligible nodes at any point in time you need to have 5 master-eligible nodes, out of which at most one should be voting-only.

Having 5 voting-only master nodes does IMHO not make any sense.

alissan · July 12, 2023, 3:01pm

I made the data nodes with voting only to data only and the problem continues.
I'm checking the cluster fault detection document.

system · August 9, 2023, 3:02pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch nodes continually disconneting/reconnecting. Resulting in very high number of unassigned shards Elasticsearch	18	2657	September 3, 2020
Data Nodes disconnected randomly Elasticsearch	3	229	March 9, 2023
Nodes continuously leaving and rejoining the cluster in 7.1 cluster after master switch Elasticsearch	8	1992	October 15, 2020
Nodes are getting disconnecting and connecting again in few seconds Elasticsearch	6	3377	March 1, 2017
Nodes disconnect without apparent reason Elasticsearch	4	510	July 6, 2017

Periodic disconnection of same data nodes

Related topics