Troubleshooting an unstable cluster

Hi Team,

We have deployed a cluster of across 9 physical machines, each hosting multiple instances. The hardware specifications per machine are:

  • 112 CPU cores
  • 503 GB memory

Here's cluster and node info

{
  "name" : "node86",
  "cluster_name" : "es-prod",
  "cluster_uuid" : "2L75uEv7RjCEvNwViB5L_w",
  "version" : {
    "number" : "7.7.1",
    "build_flavor" : "default",
    "build_type" : "tar",
    "build_hash" : "ad56dce891c901a492bb1ee393f12dfff473a423",
    "build_date" : "2020-05-28T16:30:01.040088Z",
    "build_snapshot" : false,
    "lucene_version" : "8.5.1",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}

{
  "cluster_name" : "es-prod",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 57,
  "number_of_data_nodes" : 54,
  "active_primary_shards" : 3042,
  "active_shards" : 6164,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}
name      heap.percent ram.percent cpu load_1m load_5m load_15m node.role master
node89-1            52          99  17   23.68   23.44    23.60 dilrt     -
node86-6            59          99  17   23.46   22.35    23.93 dilrt     -
node88-5            33          99  18   21.38   22.55    22.36 dilrt     -
node136-2           34          99   5    6.33    5.76     5.93 dilrt     -
node136-3           16          99   5    6.33    5.76     5.93 dilrt     -
node88-1            64          99  18   21.38   22.55    22.36 dilrt     -
node87              30          99  15   22.16   25.03    24.95 ilmr      *
node87-6            36          99  15   22.16   25.03    24.95 dilrt     -
node228-3           38          98  16   25.54   21.54    21.95 dilrt     -
node89-3            58          99  17   23.68   23.44    23.60 dilrt     -
node90-4            70          99  14   20.88   19.48    20.62 dilrt     -
node90-5            41          99  14   20.88   19.48    20.62 dilrt     -
node87-1            54          99  15   22.16   25.03    24.95 dilrt     -
node218-4           45          98  17   27.75   23.19    24.56 dilrt     -
node228-6           41          98  17   25.54   21.54    21.95 dilrt     -
node218-2           18          98  19   27.75   23.19    24.56 dilrt     -
node90-3            63          99  14   20.88   19.48    20.62 dilrt     -
node136-6           64          99   5    6.33    5.76     5.93 dilrt     -
node228-1           37          98  17   25.54   21.54    21.95 dilrt     -
node227-2           43          99  22   24.07   23.45    23.81 dilrt     -
node227-3           26          99  21   24.07   23.45    23.81 dilrt     -
node90-6            50          99  14   20.88   19.48    20.62 dilrt     -
node88-6            57          99  18   21.38   22.55    22.36 dilrt     -
node136-1           42          99   5    6.33    5.76     5.93 dilrt     -
node136-4           50          99   5    6.33    5.76     5.93 dilrt     -
node87-2            61          99  14   22.16   25.03    24.95 dilrt     -
node86-3            69          99  17   23.46   22.35    23.93 dilrt     -
node90-2            46          99  14   20.88   19.48    20.62 dilrt     -
node228-5           23          98  17   25.54   21.54    21.95 dilrt     -
node88-2            60          99  18   21.38   22.55    22.36 dilrt     -
node87-5            55          99  15   22.16   25.03    24.95 dilrt     -
node86-4            58          99  17   23.46   22.35    23.93 dilrt     -
node86-1            37          99  17   23.46   22.35    23.93 dilrt     -
node218-5           63          98  19   27.75   23.19    24.56 dilrt     -
node90-1            64          99  14   20.88   19.48    20.62 dilrt     -
node228-4           56          98  17   25.54   21.54    21.95 dilrt     -
node218-1           63          98  19   27.75   23.19    24.56 dilrt     -
node136-5           45          99   4    6.33    5.76     5.93 dilrt     -
node87-4            52          99  15   22.16   25.03    24.95 dilrt     -
node218-6           28          98  19   27.75   23.19    24.56 dilrt     -
node228-2           53          98  17   25.54   21.54    21.95 dilrt     -
node86              46          99  17   23.46   22.35    23.93 ilmr      -
node89-4            66          99  16   23.68   23.44    23.60 dilrt     -
node227-5           70          99  21   24.07   23.45    23.81 dilrt     -
node89-2            35          99  17   23.68   23.44    23.60 dilrt     -
node227-6           66          99  21   24.07   23.45    23.81 dilrt     -
node88-4            49          99  14   21.38   22.55    22.36 dilrt     -
node88-3            38          99  18   21.38   22.55    22.36 dilrt     -
node227-4           46          99  21   24.07   23.45    23.81 dilrt     -
node88              28          99  14   21.38   22.55    22.36 ilmr      -
node89-5            67          99  17   23.68   23.44    23.60 dilrt     -
node87-3            64          99  16   22.16   25.03    24.95 dilrt     -
node86-5            62          99  17   23.46   22.35    23.93 dilrt     -
node89-6            61          99  17   23.68   23.44    23.60 dilrt     -
node227-1           24          99  18   24.07   23.45    23.81 dilrt     -
node218-3           71          98  17   27.75   23.19    24.56 dilrt     -
node86-2            32          99  18   23.46   22.35    23.93 dilrt     -

Recently, the cluster has been experiencing random, short periods of unresponsiveness, causing significant delays in API calls.

During these incidents, we observed the following trace log:

[2025-04-12T11:02:48,310][TRACE][o.e.t.T.tracer           ] [node87] [43727295977][cluster:monitor/nodes/stats[n]] sent to [{node218-5}{2zc4xF7ITjSTTN3ODKt-LA}{CoxjuxgBSXqZFAYPM-TOow}{10.25.204.218}{10.25.204.218:9305}{dilrt}{rack_id=node218, ml.machine_memory=540497911808, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}] (timeout: [null])
[2025-04-12T11:02:48,310][TRACE][o.e.t.T.tracer           ] [node218-5] [43727295977][cluster:monitor/nodes/stats[n]] received request
[2025-04-12T11:03:04,592][ERROR][o.e.x.m.c.n.NodeStatsCollector] [node218-5] collector [node_stats] timed out when collecting data
[2025-04-12T11:03:06,191][TRACE][o.e.t.T.tracer           ] [node218-5] [43727295977][cluster:monitor/nodes/stats[n]] sent response
[2025-04-12T16:02:48,872][TRACE][o.e.t.T.tracer           ] [node87] [44003630793][cluster:monitor/nodes/stats[n]] sent to [{node218-5}{2zc4xF7ITjSTTN3ODKt-LA}{CoxjuxgBSXqZFAYPM-TOow}{10.25.204.218}{10.25.204.218:9305}{dilrt}{rack_id=node218, ml.machine_memory=540497911808, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}] (timeout: [15s])
[2025-04-12T16:03:05,342][ERROR][o.e.x.m.c.n.NodeStatsCollector] [node218-5] collector [node_stats] timed out when collecting data
[2025-04-12T16:03:06,007][TRACE][o.e.t.T.tracer           ] [node218-5] [44003630793][cluster:monitor/nodes/stats[n]] received request
[2025-04-12T16:03:06,133][TRACE][o.e.t.T.tracer           ] [node218-5] [44003630793][cluster:monitor/nodes/stats[n]] sent response
[2025-04-12T16:03:06,133][WARN ][o.e.t.TransportService   ] [node87] Received response for a request that has timed out, sent [17408ms] ago, timed out [2401ms] ago, action [cluster:monitor/nodes/stats[n]], node [{node218-5}{2zc4xF7ITjSTTN3ODKt-LA}{CoxjuxgBSXqZFAYPM-TOow}{10.25.204.218}{10.25.204.218:9305}{dilrt}{rack_id=node218, ml.machine_memory=540497911808, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}], id [44003630793]


Could you help investigate the root cause of these intermittent failures? Let us know if additional logs or metrics are needed.

Thank you!

This version is very old and has been EOL a long time. You should really look to upgrade.

What type of storage are you using? This is often the limiting factor and a common source of problems in Elasticsearch clusters under high load. What does disk I/O and await look like on the nodes?

How are you separating nodes on the hosts? Containers?

No, just start on different port and different disk, nothing else.
like this

...
node.name: node87-1
node.master: false
node.data: true
node.attr.rack_id: node87
cluster.routing.allocation.same_shard.host: true
path.data: /mnt/data01/node87-1/data/
path.logs: /mnt/data01/node87-1/logs/
bootstrap.memory_lock: true
http.port: 9201
transport.tcp.port: 9301

OK. That means all the nodes share the same operating system page cache and can interfere with eachother.

What type of storage are you using? What does disk I/O and await look like?

How many independent IO data paths are there, per server?

You have 6 data node instances, per server, right? So if answer to previous Q is less than 6, then some IO paths are servicing multiple ES instances.

How are your shards arranged, per β€œactive” index?

I’m presuming it all worked great for a while? Is that right? Did anything else change recently? Like increase in data volume, search volume, etc ?

Out of curiosity, were there better options? Given what they apparently wanted, 6 data node per server, plus a few masters?

What type of storage are you using?

We use SSD.

What does disk I/O and await look like?

Time between 2025-04-12T15:30:00 and 2025-04-12T16:30:00.


Is it coincidence that the 2 logs shared are both around 2-3 minutes past the hour?

11:02:48
16:02:48

?

Any cron jobs or similar running that might impact ?

We have 6 independent IO data path per server.

Most of them are arranged by size, each shard is almost less than 50GB.

Yeah, most of time it works quickly and fine. But sometimes some node failed it's health check and even left the cluster and auto re-join after a while.

Nothing special.

Sorry. I need check this.

Main thing is to check correlations. Including time. Can be other things. Snapshots, backups, scans, hosts, …

Based on what you shared, fir now I’d think it’s as likely something outside elasticsearch as within.

Drive by help... You master nodes are marked ingest, machine learning etc... in a cluster like yours they should probably not... They should probably be master only

2 Likes

FWIW I've seen cases where TRACE logging itself can cause instability in a production cluster. The reference manual recommends not doing this:

Do not enable logging at levels DEBUG or TRACE unless you are following instructions elsewhere in this manual which call for more detailed logging, or you are an expert user who will be reading the Elasticsearch source code to determine the meaning of the logs.

But yeah also 7.7 is ancient and definitely we've fixed bugs with the symptoms you describe in the 5 years since it was released. You need to upgrade to a supported version, ideally the newest available, as soon as possible.

1 Like

@lwm2016

Did you enable the trace logs before you had the stability issues?

Also, the 2 log sections you posted originally, as well as being almost exactly 5 hours apart (within about half a second!), were also mentioning exactly same 2 nodes - node87, one of your masters, and node218-5, one of your many data nodes. With so many data nodes, and my general distrust in coincidence ....

Thank you for your suggestion, we will optimize it.

We enable the trace logs after that.

That's great! Is there any related issue link? Sorry, I can't find it. Or does version 7.17 fix this issue?

There have been many changes in those 5 years that could have addressed this, and no I'm not going to dig them out. it's possible 7.17 will be better but even this series is rather old now, you need to upgrade to 8.x ASAP.

From what you have written I understand you had no issue with same 7.7.1 version / same config / same hardware as you are using now, likely for some time, you started to see some issues, you enabled the trace, you see the logs you posted, you asked in the forum for help.

Updating to either latest 7.x or 8.x are decisions you need to make. you know your environment, the rationale behind the choices you have made, and "how stuff works", meant in widest sense, within your organization / environment.

Outside what I've pointed out around around the curiosities in the logs you shared, and if you share more logs and more details of your environment we might get better insights, my own hunch (just a hunch) is your issue is "outside" elasticsearch. Some subtle hardware issue, something else running that disturbs things, whatever. But I can be wrong obviously. I'd be curious if you found anything that correlates to the timing and node-affinity of the 2 errors you shared.

Similarly David suggests you could update to newer release, which DOES contain a lot of fixes you don't have. Would it fix your current issue? He, and I, cannot know for sure. What is almost certain is if you do nothing, then your problems will either continue as now, or maybe get worse.

btw, I also notice in the output above none of the data node instances node136-1/2/3/4/5/6 were anything like as loaded as the rest of the data nodes. Whereas all others, node86-x/ node87-x / node88-x / node89-x / node90-x / node218-x / node227-x / node228-x were all significantly more loaded. I have absolutely no idea why that is so, what's different about node136? But maybe you do? Something with how the most loaded shards are distributed? Just another curiosity.

In the last 5 years we've fixed loads of bugs in ES that would explain these symptoms in sufficiently large clusters. It's not impossible that the problem is elsewhere ofc but the chances are it's one of these bugs. At least, I wouldn't start looking elsewhere until ruling them out (by upgrading). Recent versions have way better tooling for investigating this kind of issue too, so even if the upgrade doesn't fix it it'll give you what you need to open a bug report.

1 Like

David knows ES wayyyy better than I do, and the details of all the fixed bugs too. But a 57 instance production cluster upgrade is not a minor effort. I still think it is worthwhile to explore other avenues, in parallel with preparing the upgrade (which is a good idea by itself, noting 9.x will be with us soon and therefore @lwm2016 would be 2 major releases behind).

A priori I would agree with David, but the extra information that's been shared shows error logs exactly 5 hours apart (within half a second) on same 2 instances (OK, node87 was master, but node218-5 was 1 from 54 data instances) - this is fairly suggestive IMO. But indeed it is only 2 log footprints.

Let's see if @lwm2016 can figure it out, I hope so. Or shares some more findings/diagnostics.