Hi Team,
We have deployed a cluster of across 9 physical machines, each hosting multiple instances. The hardware specifications per machine are:
- 112 CPU cores
- 503 GB memory
Here's cluster and node info
{
"name" : "node86",
"cluster_name" : "es-prod",
"cluster_uuid" : "2L75uEv7RjCEvNwViB5L_w",
"version" : {
"number" : "7.7.1",
"build_flavor" : "default",
"build_type" : "tar",
"build_hash" : "ad56dce891c901a492bb1ee393f12dfff473a423",
"build_date" : "2020-05-28T16:30:01.040088Z",
"build_snapshot" : false,
"lucene_version" : "8.5.1",
"minimum_wire_compatibility_version" : "6.8.0",
"minimum_index_compatibility_version" : "6.0.0-beta1"
},
"tagline" : "You Know, for Search"
}
{
"cluster_name" : "es-prod",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 57,
"number_of_data_nodes" : 54,
"active_primary_shards" : 3042,
"active_shards" : 6164,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 100.0
}
name heap.percent ram.percent cpu load_1m load_5m load_15m node.role master
node89-1 52 99 17 23.68 23.44 23.60 dilrt -
node86-6 59 99 17 23.46 22.35 23.93 dilrt -
node88-5 33 99 18 21.38 22.55 22.36 dilrt -
node136-2 34 99 5 6.33 5.76 5.93 dilrt -
node136-3 16 99 5 6.33 5.76 5.93 dilrt -
node88-1 64 99 18 21.38 22.55 22.36 dilrt -
node87 30 99 15 22.16 25.03 24.95 ilmr *
node87-6 36 99 15 22.16 25.03 24.95 dilrt -
node228-3 38 98 16 25.54 21.54 21.95 dilrt -
node89-3 58 99 17 23.68 23.44 23.60 dilrt -
node90-4 70 99 14 20.88 19.48 20.62 dilrt -
node90-5 41 99 14 20.88 19.48 20.62 dilrt -
node87-1 54 99 15 22.16 25.03 24.95 dilrt -
node218-4 45 98 17 27.75 23.19 24.56 dilrt -
node228-6 41 98 17 25.54 21.54 21.95 dilrt -
node218-2 18 98 19 27.75 23.19 24.56 dilrt -
node90-3 63 99 14 20.88 19.48 20.62 dilrt -
node136-6 64 99 5 6.33 5.76 5.93 dilrt -
node228-1 37 98 17 25.54 21.54 21.95 dilrt -
node227-2 43 99 22 24.07 23.45 23.81 dilrt -
node227-3 26 99 21 24.07 23.45 23.81 dilrt -
node90-6 50 99 14 20.88 19.48 20.62 dilrt -
node88-6 57 99 18 21.38 22.55 22.36 dilrt -
node136-1 42 99 5 6.33 5.76 5.93 dilrt -
node136-4 50 99 5 6.33 5.76 5.93 dilrt -
node87-2 61 99 14 22.16 25.03 24.95 dilrt -
node86-3 69 99 17 23.46 22.35 23.93 dilrt -
node90-2 46 99 14 20.88 19.48 20.62 dilrt -
node228-5 23 98 17 25.54 21.54 21.95 dilrt -
node88-2 60 99 18 21.38 22.55 22.36 dilrt -
node87-5 55 99 15 22.16 25.03 24.95 dilrt -
node86-4 58 99 17 23.46 22.35 23.93 dilrt -
node86-1 37 99 17 23.46 22.35 23.93 dilrt -
node218-5 63 98 19 27.75 23.19 24.56 dilrt -
node90-1 64 99 14 20.88 19.48 20.62 dilrt -
node228-4 56 98 17 25.54 21.54 21.95 dilrt -
node218-1 63 98 19 27.75 23.19 24.56 dilrt -
node136-5 45 99 4 6.33 5.76 5.93 dilrt -
node87-4 52 99 15 22.16 25.03 24.95 dilrt -
node218-6 28 98 19 27.75 23.19 24.56 dilrt -
node228-2 53 98 17 25.54 21.54 21.95 dilrt -
node86 46 99 17 23.46 22.35 23.93 ilmr -
node89-4 66 99 16 23.68 23.44 23.60 dilrt -
node227-5 70 99 21 24.07 23.45 23.81 dilrt -
node89-2 35 99 17 23.68 23.44 23.60 dilrt -
node227-6 66 99 21 24.07 23.45 23.81 dilrt -
node88-4 49 99 14 21.38 22.55 22.36 dilrt -
node88-3 38 99 18 21.38 22.55 22.36 dilrt -
node227-4 46 99 21 24.07 23.45 23.81 dilrt -
node88 28 99 14 21.38 22.55 22.36 ilmr -
node89-5 67 99 17 23.68 23.44 23.60 dilrt -
node87-3 64 99 16 22.16 25.03 24.95 dilrt -
node86-5 62 99 17 23.46 22.35 23.93 dilrt -
node89-6 61 99 17 23.68 23.44 23.60 dilrt -
node227-1 24 99 18 24.07 23.45 23.81 dilrt -
node218-3 71 98 17 27.75 23.19 24.56 dilrt -
node86-2 32 99 18 23.46 22.35 23.93 dilrt -
Recently, the cluster has been experiencing random, short periods of unresponsiveness, causing significant delays in API calls.
During these incidents, we observed the following trace log:
[2025-04-12T11:02:48,310][TRACE][o.e.t.T.tracer ] [node87] [43727295977][cluster:monitor/nodes/stats[n]] sent to [{node218-5}{2zc4xF7ITjSTTN3ODKt-LA}{CoxjuxgBSXqZFAYPM-TOow}{10.25.204.218}{10.25.204.218:9305}{dilrt}{rack_id=node218, ml.machine_memory=540497911808, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}] (timeout: [null])
[2025-04-12T11:02:48,310][TRACE][o.e.t.T.tracer ] [node218-5] [43727295977][cluster:monitor/nodes/stats[n]] received request
[2025-04-12T11:03:04,592][ERROR][o.e.x.m.c.n.NodeStatsCollector] [node218-5] collector [node_stats] timed out when collecting data
[2025-04-12T11:03:06,191][TRACE][o.e.t.T.tracer ] [node218-5] [43727295977][cluster:monitor/nodes/stats[n]] sent response
[2025-04-12T16:02:48,872][TRACE][o.e.t.T.tracer ] [node87] [44003630793][cluster:monitor/nodes/stats[n]] sent to [{node218-5}{2zc4xF7ITjSTTN3ODKt-LA}{CoxjuxgBSXqZFAYPM-TOow}{10.25.204.218}{10.25.204.218:9305}{dilrt}{rack_id=node218, ml.machine_memory=540497911808, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}] (timeout: [15s])
[2025-04-12T16:03:05,342][ERROR][o.e.x.m.c.n.NodeStatsCollector] [node218-5] collector [node_stats] timed out when collecting data
[2025-04-12T16:03:06,007][TRACE][o.e.t.T.tracer ] [node218-5] [44003630793][cluster:monitor/nodes/stats[n]] received request
[2025-04-12T16:03:06,133][TRACE][o.e.t.T.tracer ] [node218-5] [44003630793][cluster:monitor/nodes/stats[n]] sent response
[2025-04-12T16:03:06,133][WARN ][o.e.t.TransportService ] [node87] Received response for a request that has timed out, sent [17408ms] ago, timed out [2401ms] ago, action [cluster:monitor/nodes/stats[n]], node [{node218-5}{2zc4xF7ITjSTTN3ODKt-LA}{CoxjuxgBSXqZFAYPM-TOow}{10.25.204.218}{10.25.204.218:9305}{dilrt}{rack_id=node218, ml.machine_memory=540497911808, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}], id [44003630793]
Could you help investigate the root cause of these intermittent failures? Let us know if additional logs or metrics are needed.
Thank you!