Troubleshooting an unstable cluster

lwm2016 · April 12, 2025, 10:15am

Hi Team,

We have deployed a cluster of across 9 physical machines, each hosting multiple instances. The hardware specifications per machine are:

112 CPU cores
503 GB memory

Here's cluster and node info

{
  "name" : "node86",
  "cluster_name" : "es-prod",
  "cluster_uuid" : "2L75uEv7RjCEvNwViB5L_w",
  "version" : {
    "number" : "7.7.1",
    "build_flavor" : "default",
    "build_type" : "tar",
    "build_hash" : "ad56dce891c901a492bb1ee393f12dfff473a423",
    "build_date" : "2020-05-28T16:30:01.040088Z",
    "build_snapshot" : false,
    "lucene_version" : "8.5.1",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}

{
  "cluster_name" : "es-prod",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 57,
  "number_of_data_nodes" : 54,
  "active_primary_shards" : 3042,
  "active_shards" : 6164,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

name      heap.percent ram.percent cpu load_1m load_5m load_15m node.role master
node89-1            52          99  17   23.68   23.44    23.60 dilrt     -
node86-6            59          99  17   23.46   22.35    23.93 dilrt     -
node88-5            33          99  18   21.38   22.55    22.36 dilrt     -
node136-2           34          99   5    6.33    5.76     5.93 dilrt     -
node136-3           16          99   5    6.33    5.76     5.93 dilrt     -
node88-1            64          99  18   21.38   22.55    22.36 dilrt     -
node87              30          99  15   22.16   25.03    24.95 ilmr      *
node87-6            36          99  15   22.16   25.03    24.95 dilrt     -
node228-3           38          98  16   25.54   21.54    21.95 dilrt     -
node89-3            58          99  17   23.68   23.44    23.60 dilrt     -
node90-4            70          99  14   20.88   19.48    20.62 dilrt     -
node90-5            41          99  14   20.88   19.48    20.62 dilrt     -
node87-1            54          99  15   22.16   25.03    24.95 dilrt     -
node218-4           45          98  17   27.75   23.19    24.56 dilrt     -
node228-6           41          98  17   25.54   21.54    21.95 dilrt     -
node218-2           18          98  19   27.75   23.19    24.56 dilrt     -
node90-3            63          99  14   20.88   19.48    20.62 dilrt     -
node136-6           64          99   5    6.33    5.76     5.93 dilrt     -
node228-1           37          98  17   25.54   21.54    21.95 dilrt     -
node227-2           43          99  22   24.07   23.45    23.81 dilrt     -
node227-3           26          99  21   24.07   23.45    23.81 dilrt     -
node90-6            50          99  14   20.88   19.48    20.62 dilrt     -
node88-6            57          99  18   21.38   22.55    22.36 dilrt     -
node136-1           42          99   5    6.33    5.76     5.93 dilrt     -
node136-4           50          99   5    6.33    5.76     5.93 dilrt     -
node87-2            61          99  14   22.16   25.03    24.95 dilrt     -
node86-3            69          99  17   23.46   22.35    23.93 dilrt     -
node90-2            46          99  14   20.88   19.48    20.62 dilrt     -
node228-5           23          98  17   25.54   21.54    21.95 dilrt     -
node88-2            60          99  18   21.38   22.55    22.36 dilrt     -
node87-5            55          99  15   22.16   25.03    24.95 dilrt     -
node86-4            58          99  17   23.46   22.35    23.93 dilrt     -
node86-1            37          99  17   23.46   22.35    23.93 dilrt     -
node218-5           63          98  19   27.75   23.19    24.56 dilrt     -
node90-1            64          99  14   20.88   19.48    20.62 dilrt     -
node228-4           56          98  17   25.54   21.54    21.95 dilrt     -
node218-1           63          98  19   27.75   23.19    24.56 dilrt     -
node136-5           45          99   4    6.33    5.76     5.93 dilrt     -
node87-4            52          99  15   22.16   25.03    24.95 dilrt     -
node218-6           28          98  19   27.75   23.19    24.56 dilrt     -
node228-2           53          98  17   25.54   21.54    21.95 dilrt     -
node86              46          99  17   23.46   22.35    23.93 ilmr      -
node89-4            66          99  16   23.68   23.44    23.60 dilrt     -
node227-5           70          99  21   24.07   23.45    23.81 dilrt     -
node89-2            35          99  17   23.68   23.44    23.60 dilrt     -
node227-6           66          99  21   24.07   23.45    23.81 dilrt     -
node88-4            49          99  14   21.38   22.55    22.36 dilrt     -
node88-3            38          99  18   21.38   22.55    22.36 dilrt     -
node227-4           46          99  21   24.07   23.45    23.81 dilrt     -
node88              28          99  14   21.38   22.55    22.36 ilmr      -
node89-5            67          99  17   23.68   23.44    23.60 dilrt     -
node87-3            64          99  16   22.16   25.03    24.95 dilrt     -
node86-5            62          99  17   23.46   22.35    23.93 dilrt     -
node89-6            61          99  17   23.68   23.44    23.60 dilrt     -
node227-1           24          99  18   24.07   23.45    23.81 dilrt     -
node218-3           71          98  17   27.75   23.19    24.56 dilrt     -
node86-2            32          99  18   23.46   22.35    23.93 dilrt     -

Recently, the cluster has been experiencing random, short periods of unresponsiveness, causing significant delays in API calls.

During these incidents, we observed the following trace log:

[2025-04-12T11:02:48,310][TRACE][o.e.t.T.tracer           ] [node87] [43727295977][cluster:monitor/nodes/stats[n]] sent to [{node218-5}{2zc4xF7ITjSTTN3ODKt-LA}{CoxjuxgBSXqZFAYPM-TOow}{10.25.204.218}{10.25.204.218:9305}{dilrt}{rack_id=node218, ml.machine_memory=540497911808, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}] (timeout: [null])
[2025-04-12T11:02:48,310][TRACE][o.e.t.T.tracer           ] [node218-5] [43727295977][cluster:monitor/nodes/stats[n]] received request
[2025-04-12T11:03:04,592][ERROR][o.e.x.m.c.n.NodeStatsCollector] [node218-5] collector [node_stats] timed out when collecting data
[2025-04-12T11:03:06,191][TRACE][o.e.t.T.tracer           ] [node218-5] [43727295977][cluster:monitor/nodes/stats[n]] sent response

[2025-04-12T16:02:48,872][TRACE][o.e.t.T.tracer           ] [node87] [44003630793][cluster:monitor/nodes/stats[n]] sent to [{node218-5}{2zc4xF7ITjSTTN3ODKt-LA}{CoxjuxgBSXqZFAYPM-TOow}{10.25.204.218}{10.25.204.218:9305}{dilrt}{rack_id=node218, ml.machine_memory=540497911808, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}] (timeout: [15s])
[2025-04-12T16:03:05,342][ERROR][o.e.x.m.c.n.NodeStatsCollector] [node218-5] collector [node_stats] timed out when collecting data
[2025-04-12T16:03:06,007][TRACE][o.e.t.T.tracer           ] [node218-5] [44003630793][cluster:monitor/nodes/stats[n]] received request
[2025-04-12T16:03:06,133][TRACE][o.e.t.T.tracer           ] [node218-5] [44003630793][cluster:monitor/nodes/stats[n]] sent response
[2025-04-12T16:03:06,133][WARN ][o.e.t.TransportService   ] [node87] Received response for a request that has timed out, sent [17408ms] ago, timed out [2401ms] ago, action [cluster:monitor/nodes/stats[n]], node [{node218-5}{2zc4xF7ITjSTTN3ODKt-LA}{CoxjuxgBSXqZFAYPM-TOow}{10.25.204.218}{10.25.204.218:9305}{dilrt}{rack_id=node218, ml.machine_memory=540497911808, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}], id [44003630793]

Could you help investigate the root cause of these intermittent failures? Let us know if additional logs or metrics are needed.

Thank you!

Christian_Dahlqvist · April 12, 2025, 10:26am

This version is very old and has been EOL a long time. You should really look to upgrade.

What type of storage are you using? This is often the limiting factor and a common source of problems in Elasticsearch clusters under high load. What does disk I/O and await look like on the nodes?

How are you separating nodes on the hosts? Containers?

lwm2016 · April 12, 2025, 11:20am

No, just start on different port and different disk, nothing else.
like this

...
node.name: node87-1
node.master: false
node.data: true
node.attr.rack_id: node87
cluster.routing.allocation.same_shard.host: true
path.data: /mnt/data01/node87-1/data/
path.logs: /mnt/data01/node87-1/logs/
bootstrap.memory_lock: true
http.port: 9201
transport.tcp.port: 9301

Christian_Dahlqvist · April 12, 2025, 11:28am

OK. That means all the nodes share the same operating system page cache and can interfere with eachother.

What type of storage are you using? What does disk I/O and await look like?

RainTown · April 12, 2025, 12:43pm

How many independent IO data paths are there, per server?

You have 6 data node instances, per server, right? So if answer to previous Q is less than 6, then some IO paths are servicing multiple ES instances.

How are your shards arranged, per “active” index?

I’m presuming it all worked great for a while? Is that right? Did anything else change recently? Like increase in data volume, search volume, etc ?

RainTown · April 12, 2025, 12:47pm

Out of curiosity, were there better options? Given what they apparently wanted, 6 data node per server, plus a few masters?

lwm2016 · April 12, 2025, 12:49pm

What type of storage are you using?

We use SSD.

What does disk I/O and await look like?

Time between 2025-04-12T15:30:00 and 2025-04-12T16:30:00.

RainTown · April 12, 2025, 12:58pm

Is it coincidence that the 2 logs shared are both around 2-3 minutes past the hour?

11:02:48
16:02:48

?

Any cron jobs or similar running that might impact ?

lwm2016 · April 12, 2025, 12:59pm

We have 6 independent IO data path per server.

Most of them are arranged by size, each shard is almost less than 50GB.

Yeah, most of time it works quickly and fine. But sometimes some node failed it's health check and even left the cluster and auto re-join after a while.

Nothing special.

lwm2016 · April 12, 2025, 1:13pm

Sorry. I need check this.

RainTown · April 12, 2025, 2:33pm

Main thing is to check correlations. Including time. Can be other things. Snapshots, backups, scans, hosts, …

Based on what you shared, fir now I’d think it’s as likely something outside elasticsearch as within.

stephenb · April 12, 2025, 2:52pm

Drive by help... You master nodes are marked ingest, machine learning etc... in a cluster like yours they should probably not... They should probably be master only

DavidTurner · April 13, 2025, 6:44pm

FWIW I've seen cases where TRACE logging itself can cause instability in a production cluster. The reference manual recommends not doing this:

Do not enable logging at levels DEBUG or TRACE unless you are following instructions elsewhere in this manual which call for more detailed logging, or you are an expert user who will be reading the Elasticsearch source code to determine the meaning of the logs.

But yeah also 7.7 is ancient and definitely we've fixed bugs with the symptoms you describe in the 5 years since it was released. You need to upgrade to a supported version, ideally the newest available, as soon as possible.

RainTown · April 13, 2025, 10:35pm

@lwm2016

Did you enable the trace logs before you had the stability issues?

Also, the 2 log sections you posted originally, as well as being almost exactly 5 hours apart (within about half a second!), were also mentioning exactly same 2 nodes - node87, one of your masters, and node218-5, one of your many data nodes. With so many data nodes, and my general distrust in coincidence ....

lwm2016 · April 14, 2025, 1:53am

Thank you for your suggestion, we will optimize it.

lwm2016 · April 14, 2025, 1:55am

We enable the trace logs after that.

That's great! Is there any related issue link? Sorry, I can't find it. Or does version 7.17 fix this issue?

DavidTurner · April 14, 2025, 5:02am

There have been many changes in those 5 years that could have addressed this, and no I'm not going to dig them out. it's possible 7.17 will be better but even this series is rather old now, you need to upgrade to 8.x ASAP.

RainTown · April 14, 2025, 10:32am

From what you have written I understand you had no issue with same 7.7.1 version / same config / same hardware as you are using now, likely for some time, you started to see some issues, you enabled the trace, you see the logs you posted, you asked in the forum for help.

Updating to either latest 7.x or 8.x are decisions you need to make. you know your environment, the rationale behind the choices you have made, and "how stuff works", meant in widest sense, within your organization / environment.

Outside what I've pointed out around around the curiosities in the logs you shared, and if you share more logs and more details of your environment we might get better insights, my own hunch (just a hunch) is your issue is "outside" elasticsearch. Some subtle hardware issue, something else running that disturbs things, whatever. But I can be wrong obviously. I'd be curious if you found anything that correlates to the timing and node-affinity of the 2 errors you shared.

Similarly David suggests you could update to newer release, which DOES contain a lot of fixes you don't have. Would it fix your current issue? He, and I, cannot know for sure. What is almost certain is if you do nothing, then your problems will either continue as now, or maybe get worse.

btw, I also notice in the output above none of the data node instances node136-1/2/3/4/5/6 were anything like as loaded as the rest of the data nodes. Whereas all others, node86-x/ node87-x / node88-x / node89-x / node90-x / node218-x / node227-x / node228-x were all significantly more loaded. I have absolutely no idea why that is so, what's different about node136? But maybe you do? Something with how the most loaded shards are distributed? Just another curiosity.

DavidTurner · April 14, 2025, 11:08am

In the last 5 years we've fixed loads of bugs in ES that would explain these symptoms in sufficiently large clusters. It's not impossible that the problem is elsewhere ofc but the chances are it's one of these bugs. At least, I wouldn't start looking elsewhere until ruling them out (by upgrading). Recent versions have way better tooling for investigating this kind of issue too, so even if the upgrade doesn't fix it it'll give you what you need to open a bug report.

RainTown · April 14, 2025, 12:23pm

David knows ES wayyyy better than I do, and the details of all the fixed bugs too. But a 57 instance production cluster upgrade is not a minor effort. I still think it is worthwhile to explore other avenues, in parallel with preparing the upgrade (which is a good idea by itself, noting 9.x will be with us soon and therefore @lwm2016 would be 2 major releases behind).

A priori I would agree with David, but the extra information that's been shared shows error logs exactly 5 hours apart (within half a second) on same 2 instances (OK, node87 was master, but node218-5 was 1 from 54 data instances) - this is fairly suggestive IMO. But indeed it is only 2 log footprints.

Let's see if @lwm2016 can figure it out, I hope so. Or shares some more findings/diagnostics.

Topic		Replies	Views
Cluster crash, symptoms and possible explanation Elasticsearch	20	2133	July 6, 2017
Cluster Health degraded overnight with no apparent reason Elasticsearch	5	1689	July 6, 2017
Cascading cluster failure Elasticsearch	13	501	July 6, 2017
Stability issues with elasticsearch cluster Elasticsearch	6	1442	July 6, 2017
Disappearing Shards Elasticsearch	10	406	July 6, 2017

Troubleshooting an unstable cluster

Related topics