Elasticsearch nodes continually disconneting/reconnecting. Resulting in very high number of unassigned shards

I'm running an es cluster (7.4) with 3 master, 3 data, and 1 coord node.

After running smoothly for months, nodes began to continually disconnect/reconnect from the cluster. From observing, it seems like 1 particular master node and 2 particular data nodes are the culprits, but I'm very lost on how to tackle this problem.

some logs from master node that keeps disconnecting:

[2020-08-04T10:35:10,908][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [elk00-master] collector [cluster_stats] timed out when collecting data
[2020-08-04T10:35:12,471][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [elk00-master] failed to execute on node [hF7l7lJqToywM0Ol2X1yMw]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [elk05-data][10.254.113.18:9300][cluster:monitor/nodes/stats[n]] request_id [602470] timed out after [15012ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1022) [elasticsearch-7.4.0.jar:7.4.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:703) [elasticsearch-7.4.0.jar:7.4.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:830) [?:?]
[2020-08-04T10:35:12,814][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [elk00-master] failed to execute on node [hF7l7lJqToywM0Ol2X1yMw]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [elk05-data][10.254.113.18:9300][cluster:monitor/nodes/stats[n]] request_id [602489] timed out after [15012ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1022) [elasticsearch-7.4.0.jar:7.4.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:703) [elasticsearch-7.4.0.jar:7.4.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:830) [?:?]
[2020-08-04T10:35:27,492][INFO ][o.e.c.s.ClusterApplierService] [elk00-master] added {{elk04-data}{6kbmixx4T5e9DlvaYgZpqA}{bXRbtweXSxqJi4M1TeDdxA}{10.254.113.40}{10.254.113.40:9300}{dil}{ml.machine_memory=33565130752, ml.max_open_jobs=20, xpack.installed=true},}, term: 1527, version: 64772, reason: Publication{term=1527, version=64772}
[2020-08-04T10:35:27,498][WARN ][o.e.c.c.C.CoordinatorPublication] [elk00-master] after [30s] publication of cluster state version [64772] is still waiting for {elk01-master}{A6CKDkBgTEmSRofFg7COjQ}{yxZzjWolTGWGVZpfID_e6A}{10.254.113.45}{10.254.113.45:9300}{lm}{ml.machine_memory=33565290496, ml.max_open_jobs=20, xpack.installed=true} [SENT_PUBLISH_REQUEST], {elk04-data}{6kbmixx4T5e9DlvaYgZpqA}{bXRbtweXSxqJi4M1TeDdxA}{10.254.113.40}{10.254.113.40:9300}{dil}{ml.machine_memory=33565130752, ml.max_open_jobs=20, xpack.installed=true} [SENT_APPLY_COMMIT], {elk05-data}{hF7l7lJqToywM0Ol2X1yMw}{3k5YYeu3S0-4mVJwbuK5yA}{10.254.113.18}{10.254.113.18:9300}{dil}{ml.machine_memory=33565298688, ml.max_open_jobs=20, xpack.installed=true} [SENT_APPLY_COMMIT]

some logs from data node that keeps disconnecting:

[2020-08-04T11:09:17,483][INFO ][o.e.c.c.JoinHelper       ] [elk04-data] failed to join {elk00-master}{VPsRjHluQPq32otzybp0kg}{aYewTJWiQgW_OPiBSfvWJA}{10.254.113.51}{10.254.113.51:9300}{lm}{ml.machine_memory=33565298688, ml.max_open_jobs=20, xpack.installed=true} with JoinRequest{sourceNode={elk04-data}{6kbmixx4T5e9DlvaYgZpqA}{bXRbtweXSxqJi4M1TeDdxA}{10.254.113.40}{10.254.113.40:9300}{dil}{ml.machine_memory=33565130752, xpack.installed=true, ml.max_open_jobs=20}, optionalJoin=Optional[Join{term=1527, lastAcceptedTerm=1522, lastAcceptedVersion=62819, sourceNode={elk04-data}{6kbmixx4T5e9DlvaYgZpqA}{bXRbtweXSxqJi4M1TeDdxA}{10.254.113.40}{10.254.113.40:9300}{dil}{ml.machine_memory=33565130752, xpack.installed=true, ml.max_open_jobs=20}, targetNode={elk00-master}{VPsRjHluQPq32otzybp0kg}{aYewTJWiQgW_OPiBSfvWJA}{10.254.113.51}{10.254.113.51:9300}{lm}{ml.machine_memory=33565298688, ml.max_open_jobs=20, xpack.installed=true}}]}
org.elasticsearch.transport.ReceiveTimeoutTransportException: [elk00-master][10.254.113.51:9300][internal:cluster/coordination/join] request_id [11269] timed out after [59847ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1022) [elasticsearch-7.4.0.jar:7.4.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:703) [elasticsearch-7.4.0.jar:7.4.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:830) [?:?]
[2020-08-04T11:09:27,485][INFO ][o.e.c.c.JoinHelper       ] [elk04-data] last failed join attempt was 10s ago, failed to join {elk00-master}{VPsRjHluQPq32otzybp0kg}{aYewTJWiQgW_OPiBSfvWJA}{10.254.113.51}{10.254.113.51:9300}{lm}{ml.machine_memory=33565298688, ml.max_open_jobs=20, xpack.installed=true} with JoinRequest{sourceNode={elk04-data}{6kbmixx4T5e9DlvaYgZpqA}{bXRbtweXSxqJi4M1TeDdxA}{10.254.113.40}{10.254.113.40:9300}{dil}{ml.machine_memory=33565130752, xpack.installed=true, ml.max_open_jobs=20}, optionalJoin=Optional[Join{term=1527, lastAcceptedTerm=1522, lastAcceptedVersion=62819, sourceNode={elk04-data}{6kbmixx4T5e9DlvaYgZpqA}{bXRbtweXSxqJi4M1TeDdxA}{10.254.113.40}{10.254.113.40:9300}{dil}{ml.machine_memory=33565130752, xpack.installed=true, ml.max_open_jobs=20}, targetNode={elk00-master}{VPsRjHluQPq32otzybp0kg}{aYewTJWiQgW_OPiBSfvWJA}{10.254.113.51}{10.254.113.51:9300}{lm}{ml.machine_memory=33565298688, ml.max_open_jobs=20, xpack.installed=true}}]}
org.elasticsearch.transport.ReceiveTimeoutTransportException: [elk00-master][10.254.113.51:9300][internal:cluster/coordination/join] request_id [11269] timed out after [59847ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1022) ~[elasticsearch-7.4.0.jar:7.4.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:703) ~[elasticsearch-7.4.0.jar:7.4.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:830) [?:?]
[2020-08-04T11:09:47,490][WARN ][o.e.c.c.ClusterFormationFailureHelper] [elk04-data] master not discovered yet: have discovered [{elk04-data}{6kbmixx4T5e9DlvaYgZpqA}{bXRbtweXSxqJi4M1TeDdxA}{10.254.113.40}{10.254.113.40:9300}{dil}{ml.machine_memory=33565130752, xpack.installed=true, ml.max_open_jobs=20}, {elk02-master}{zllmwxZVSLqkmm6dRtu0Bw}{IA6y_BnoTuyilEEBlcZnlg}{10.254.113.28}{10.254.113.28:9300}{lm}{ml.machine_memory=33565298688, ml.max_open_jobs=20, xpack.installed=true}, {elk00-master}{VPsRjHluQPq32otzybp0kg}{aYewTJWiQgW_OPiBSfvWJA}{10.254.113.51}{10.254.113.51:9300}{lm}{ml.machine_memory=33565298688, ml.max_open_jobs=20, xpack.installed=true}, {elk01-master}{A6CKDkBgTEmSRofFg7COjQ}{yxZzjWolTGWGVZpfID_e6A}{10.254.113.45}{10.254.113.45:9300}{lm}{ml.machine_memory=33565290496, ml.max_open_jobs=20, xpack.installed=true}]; discovery will continue using [10.254.113.51:9300, 10.254.113.45:9300, 10.254.113.28:9300] from hosts providers and [{elk02-master}{zllmwxZVSLqkmm6dRtu0Bw}{IA6y_BnoTuyilEEBlcZnlg}{10.254.113.28}{10.254.113.28:9300}{lm}{ml.machine_memory=33565298688, ml.max_open_jobs=20, xpack.installed=true}, {elk00-master}{VPsRjHluQPq32otzybp0kg}{aYewTJWiQgW_OPiBSfvWJA}{10.254.113.51}{10.254.113.51:9300}{lm}{ml.machine_memory=33565298688, ml.max_open_jobs=20, xpack.installed=true}, {elk01-master}{A6CKDkBgTEmSRofFg7COjQ}{yxZzjWolTGWGVZpfID_e6A}{10.254.113.45}{10.254.113.45:9300}{lm}{ml.machine_memory=33565290496, ml.max_open_jobs=20, xpack.installed=true}] from last-known cluster state; node term 1527, last-accepted version 65165 in term 1527
[2020-08-04T11:09:51,591][ERROR][o.e.x.m.c.n.NodeStatsCollector] [elk04-data] collector [node_stats] timed out when collecting data

_cat/health?v&ts=false&pretty:

cluster  status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
T2-OSS02 red             5         2    158 158    0    6      882            52              49.1m                 15.1%

I believe the cause was due to some disk related issue on the RHEL servers that these nodes are hosted on, resulting in some data loss on their end. They say everything is stable again on their end, but I've tried to restart the cluster with no luck.

Please let me know if you have any input!

Thank you

These are request timeouts.

org.elasticsearch.transport.ReceiveTimeoutTransportException: [elk05-data][10.254.113.18:9300][cluster:monitor/nodes/stats[n]] request_id [602470] timed out after [15012ms]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [elk00-master][10.254.113.51:9300][internal:cluster/coordination/join] request_id [11269] timed out after [59847ms]

I've seen this on some clusters with disk IO that is unable to properly handle the clusters requires.
You didn't really provide any specifics on your configuration/usage, so I'd suggest you collect metrics on the servers (CPU usage, RAM usage, JVM Heap Usage, Disk IO, Disk Wait, Disk Latency). These will probably point you in a good direction.

Also, on a slightly unrelated note, you may want to consider dropping the coord node, your cluster isn't that large a doesn't really require one (and only having 1 is a single point of failure). Your data nodes should be able to handle the function just fine.

1 Like

Also what is heap and CPU situation on master as common to lose nodes on master timeouts when it's overloaded, though your index/shard count is not too high, but I'm hoping your master is not 1GB Heap or less (though I see 32GB RAM here).

Fail to join means serious issue between this node and the master, either network is messed up or master is very slow for some reason - CPU, RAM or disk, but if it's master only, kinda hard for disk to be too slow unless it's just freezing. Much more likely to be RAM use/heap or maybe network.

Do cluster command directed at the master like health, stats, etc. return quickly and consistently (everything except state should be fast within a second or two).

What led you to believe it was RHEL disk issue, or data loss?

This isn't true, by a wide margin the most common cause of these sorts of messages is slow IO. In 7.4 it can take quite a lot of fsync()s to write the cluster state to disk, sometimes hundreds or even thousands, and this can take over a minute on poor disks resulting in this kind of timeout. This is streamlined in more recent versions so the simplest fix may be to upgrade.

1 Like

After checking node health using the API, it looks like almost all the data nodes are at 99% RAM consumption.

I have yet to install sysstat to be able to monitor things at a deeper level, but will come back to this response if I find something noteworthy

To your point about the coord node, at what point would you say it would be worth having one? We have Grafana hosted on the coord node as well, so I just figured it'd be easier to have all the queries pointed towards a single machine. Would you suggest turning the coord node into an additional data node?

All nodes have 32GB RAM and 16 cores.. And the cluster hasn't had any issues until now.

Yeah, commands directed at the masters return quickly. The one exception is when running '_cat/nodes?v&pretty' it can take up to 20 seconds to return.

The cloud provider let me know that they had some faulty hard drives that were slowing down the environments and affecting latency, which caused the initial outage of my cluster. But after I was notified it was fixed, I tried to restart my cluster but couldn't get it to stick. The number of unassigned shards kept fluctuating, and it was clear that the nodes were struggling to recover.

Appreciate the insight

Given the terrible state of the cluster and high number of unavailable shards, would I still be able to snapshot and restore the original state of the data? It's important that I'm able to save it, as it dates back several months.

Given that each node has 32GB RAM, and we're dealing with ~120GB of data that is floating around on unassigned shards, does it make sense that the cluster is having trouble resolving this workload under normal circumstances?

I ask because there could still be an underlying issue with the Hard Drives on our cloud provider, which would affect the IO operations on our nodes.. In which case I would rather let them handle it on their end than roll out an upgrade if I don't need to.

A few things:

it looks like almost all the data nodes are at 99% RAM consumption

What do you mean by the above? While this is fine if 16GB is used for Heap and 16GB is used for the file system, or <16GB is used for Heap and >16GB is used for the file system. It can be a problem if your Heap is >16GB, you might want to reduce your Heap to at least 16GB if so. Or if you server is swapping, try disabling swapping as it can cause issues.

If you can have the cluster offline, I'd recommend that you take all nodes down, bring up your master nodes so they gain quorum. Then active one data node at a time and let the shards recover. You can follow the steps here: https://www.elastic.co/guide/en/elasticsearch/reference/current/restart-upgrade.html without actually doing the upgrade parts.

If you're using Grafana, the coord node might be fine as they're both single points of failure in your setup. You don't really get any performance advantages with the coord node since your cluster is so small, but since Grafana only allows one URL, it provides some usability.

No, any red shards will not be included in a snapshot.

Nodes failing to join the cluster is never normal, but it's pretty tricky to say what's going wrong in 7.4. As well as better performance, newer versions also have better logging and diagnostics to help in this kind of situation.

The nodes each have 16GB allocated for heap usage

I was just told that our cloud provider is still having some disk related issues.. I'm hoping this was the underlying cause of the nodes timing out/disconnecting.

Once service has been restored, I'll try to bring the cluster back up following the methodology you laid out.

Appreciate your help on this

It appears as though the Disk I/O degradation was the culprit

After the cloud provider fixed their hard drive issue and rebooting the instances hosting the nodes, the cluster began to run normally

Thank you!!

Really? How often does the master sync to disk, especially in stable cluster where shards presumably aren't moving often, etc.?

What is updated the most, as I'd assume segment merge/commits is most updated data as new indexes or shard movements would be rare, like per hour/day.

I assumed (I guess wrongly ) the master was syncing whole state from time to time and data relatively limited in size, and able to answer API calls simultaneously on it's data in RAM.

In a stable cluster, not so much. But that's not relevant here, we're talking about a node joining the cluster, which (amongst other things) requires that node to write the complete cluster state to disk.

The (persisted) cluster state only contains metadata; the master plays no part in merges or commits as these are handled locally on the data node. It would be a dreadful bottleneck otherwise.

Shard movement does involve the master at a very high level, but IIRC the only cluster state update that involves writing anything disk in a shard movement is when it completes and the new shard is marked as in-sync, and that only needs us to write the metadata for that one index.

There's no time-based component, the state is written to disk as needed, but no more frequently than that. The writing is incremental so if not much changes then not much is written. However when joining a cluster essentially the whole state has changed (or at least we can't tell what hasn't changed so we must be conservative) so it must all be written out.

1 Like

Makes sense, and so the segment list you can get via API must come from all the nodes in real-time then, not from the cluster state.

But if no segment updates, how can the master get so bound up in IO and fsync()s that it times out in a 3 node cluster that presumably is stable? How can it be writing/syncing more than once a second or even once a minute?

I guess I'm lost on the 'hundreds or thousands' of fsync() that can happen, other than nodes leaving/joining.

Yep, the segments API fans out to all data nodes.

The cluster in the OP isn't stable, it has a node continually joining/leaving, and the IO in question isn't happening on the master, it's on the affected node where it's too slow.

I'm lost too, why are you asking about the case where nodes aren't leaving/joining? That's exactly the case we're discussing here. In a stable cluster this should be much less of a concern.

1 Like

Sorry, does seem slow/bad disks were the likely trigger even in normal operations, and then so slow the nodes requests timed out, and dropped out.

My general assumption was presuming it starts stable, then gets in trouble with nodes disconnecting due to timeouts, via "After running smoothly for months, nodes began to continually disconnect/reconnect from the cluster." which absent disk issues was confusing as little should be going on to cause them to drop out, I imagine the disks froze or had such insane latency even tiny things timed out.

Anyway, interesting.

Ah I see. Maybe, or maybe it initially dropped out for some other reason (e.g. network blip).

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.