Periodic disconnection of same data nodes

Hello,

I have a cluster with 3 master, 40 data nodes (d1,d2,...,d40).
First 5 data nodes have voting only master role.

Only the following data nodes have periodic abnormal behavior:

d11,d12,d13,d14,d15,d16,d17,d21,d22

These nodes disconnect from cluster every ~65 mins and rejoin after ~10 mins.

This chart for node counts by time (Jul 1-11):
High values indicate the number of nodes that should be (40), low values after the nodes are disconnected (31).

This chart for minute difference between node counts change times (Jul 1-11):
High values (~65 mins) show the time they are connected, low times (~11 mins) indicate the time they are disconnected.

elasticsearch.yml and jvm.options files are same for all data nodes.

Before disconnecting, the following error log occurs in master node log file:

[2023-07-12T00:10:12,946][ERROR][o.e.x.m.c.i.IndexStatsCollector] [m01] collector [index-stats] timed out when collecting data: nodes [Hs20tBbARLmfVIwfl_uq6g, aZFlTvfKR3KgoAwa9gHdLA, 4_Lx62u9Qsqwbzwz0A496Q, OJQBrR0URo2R95j7epmyag, CT0jbNdlQsypPonefjrrVw, wDIuVUurTZyfTbd-KZUkaw, TYepUP6qQpWbpXZQax8K5Q, wfUp9qdXQsqysmrQ1Bsl6A, UdKxMlRcTASjVujQr9EM4w] did not respond within [10s]

Config files and stats are here.

Note: In elasticsearch.yml file, discovery.seed_hosts value have 48 item, because i'm planning to add 5 new data nodes but installation not completed.

I'm collecting node counts for every mins. You can see this data in number_of_data_nodes_2023.07.log file.

Also i have disconnected nodes list by minute in disconnected_nodes_2023.07.log file.

Where should i check to fix this problem?

Thanks.

This does not make any sense as you have 3 master eligible nodes. What is the rationale behind this?

You should in my opinion never have more than one voting only master node as it is designed to act as a tiebreaker and only if you have an even number of master eligible nodes.

I would recommend making these normal data nodes and see if it has any effect.

Also, what. is the specification of the cluster in terms of hardware and type of storage used? Which version of Elasticsearch are you using?

See these docs:

1 Like

Hi Christian,

I've seen in a document that no more than half of the master nodes should be shut down. This may cause data loss.
For this reason, I configured 5 data nodes as voting only masters.
If 2 master servers are accidentally shut down, the cluster may fail.

Is this information no longer valid?

master nodes: physical machines
data nodes: virtual machines on 3 different vmware hosts
disks: ssd, connected with fiber channel
elastic version: 8.6.2

If you need to be ble to handle the loss of 2 master eligible nodes at any point in time you need to have 5 master-eligible nodes, out of which at most one should be voting-only.

Having 5 voting-only master nodes does IMHO not make any sense.

I made the data nodes with voting only to data only and the problem continues.
I'm checking the cluster fault detection document.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.