Received response for a request that has timed out and "failed to retrieve stats for node"

sanju1323 · September 27, 2023, 9:20am

Hi..

I have a 5 node cluster where we are receiving data around 400 GB per day. Disk in the cluster is at 70% mark on all the nodes.
We have observed slowness in running simple queries like _cat/indices also. When we checked the logs, we are continuously getting the below Warning messages.

[2023-09-27T09:09:13,854][WARN ][o.e.c.InternalClusterInfoService] [master-node] failed to retrieve stats for node [PYlSvaErTTWGns-QCk4H9A]: [data-node-4][x.x.x.x:9300][cluster:monitor/nodes/stats[n]] request_id [340711879] timed out after [15005ms]
[2023-09-27T09:09:13,856][WARN ][o.e.c.InternalClusterInfoService] [master-node] failed to retrieve shard stats from node [PYlSvaErTTWGns-QCk4H9A]: [data-node-4][x.x.x.x:9300][indices:monitor/stats[n]] request_id [340711891] timed out after [15005ms]
[2023-09-27T09:09:24,962][WARN ][o.e.t.TransportService   ] [master-node] Received response for a request that has timed out, sent [26.2s/26209ms] ago, timed out [11.2s/11204ms] ago, action [cluster:monitor/nodes/stats[n]], node [{data-node-4}{PYlSvaErTTWGns-QCk4H9A}{KR0hKPZjRO26lyYBilY3oQ}{x.x.x.x}{x.x.x.x:9300}{dimv}{xpack.installed=true, transform.node=false}], id [340711879]
[2023-09-27T09:09:25,013][WARN ][o.e.t.TransportService   ] [master-node] Received response for a request that has timed out, sent [26.2s/26209ms] ago, timed out [11.2s/11204ms] ago, action [indices:monitor/stats[n]], node [{data-node-4}{PYlSvaErTTWGns-QCk4H9A}{KR0hKPZjRO26lyYBilY3oQ}{x.x.x.x}{x.x.x.x:9300}{dimv}{xpack.installed=true, transform.node=false}], id [340711891]

When there is no indexing load, the cluster is behaving well.
Also, we have an index with 400 GB of data with 20 shards to it. Is that making any issue.
Not sure what exactly happening in the cluster.
Can someone help us to resolve this issue.

DavidTurner · September 27, 2023, 9:13pm

Hi Sanjay, what version are you using? Also I believe you might have a support contract, have you opened a support case about this?

sanju1323 · September 28, 2023, 7:12am

Thanks @DavidTurner for you response.

We are using 7.16.3 version of Elasticsearch and at present, we lack a support contract, so we handle cluster maintenance ourselves.

Christian_Dahlqvist · September 28, 2023, 7:23am

What is the full output of the cluster stats API?

What type of storage are you using? Local SSDs?

sanju1323 · September 28, 2023, 8:13am

@Christian_Dahlqvist ,

We are using AWS Graviton based servers with HDD.

Please find the cluster stats below

{
	"_nodes": {
		"total": 9,
		"successful": 9,
		"failed": 0
	},
	"cluster_name": "logsearch",
	"cluster_uuid": "cUEYj2HwbggGh31RgCFTO5",
	"timestamp": 1695886773943,
	"status": "green",
	"indices": {
		"count": 1334,
		"shards": {
			"total": 2168,
			"primaries": 2117,
			"replication": 0.024090694378837979,
			"index": {
				"shards": {
					"min": 1,
					"max": 20,
					"avg": 1.6251874062968517
				},
				"primaries": {
					"min": 1,
					"max": 20,
					"avg": 1.5869565217391304
				},
				"replication": {
					"min": 0.0,
					"max": 1.0,
					"avg": 0.03823088455772114
				}
			}
		},
		"docs": {
			"count": 14138344965,
			"deleted": 171333523
		},
		"store": {
			"size_in_bytes": 11529600888415,
			"total_data_set_size_in_bytes": 11529600888415,
			"reserved_in_bytes": 0
		},
		"fielddata": {
			"memory_size_in_bytes": 4496812096,
			"evictions": 0
		},
		"query_cache": {
			"memory_size_in_bytes": 624981767,
			"total_count": 9058293774,
			"hit_count": 526847372,
			"miss_count": 8531446402,
			"cache_size": 632479,
			"cache_count": 27701572,
			"evictions": 27069093
		},
		"completion": {
			"size_in_bytes": 0
		},
		"segments": {
			"count": 38949,
			"memory_in_bytes": 1811063204,
			"terms_memory_in_bytes": 1594972056,
			"stored_fields_memory_in_bytes": 62419056,
			"term_vectors_memory_in_bytes": 0,
			"norms_memory_in_bytes": 1058560,
			"points_memory_in_bytes": 0,
			"doc_values_memory_in_bytes": 152613532,
			"index_writer_memory_in_bytes": 322627488,
			"version_map_memory_in_bytes": 27351944,
			"fixed_bit_set_memory_in_bytes": 2264,
			"max_unsafe_auto_id_timestamp": 1695886687017,
			"file_sizes": {}
		},
		"mappings": {
			"field_types": [
				{
					"name": "boolean",
					"count": 576,
					"index_count": 206,
					"script_count": 0
				},
				{
					"name": "constant_keyword",
					"count": 6,
					"index_count": 2,
					"script_count": 0
				},
				{
					"name": "date",
					"count": 3369,
					"index_count": 1324,
					"script_count": 0
				},
				{
					"name": "float",
					"count": 1874,
					"index_count": 411,
					"script_count": 0
				},
				{
					"name": "geo_point",
					"count": 1232,
					"index_count": 1232,
					"script_count": 0
				},
				{
					"name": "half_float",
					"count": 2464,
					"index_count": 1232,
					"script_count": 0
				},
				{
					"name": "integer",
					"count": 688,
					"index_count": 344,
					"script_count": 0
				},
				{
					"name": "ip",
					"count": 4782,
					"index_count": 1234,
					"script_count": 0
				},
				{
					"name": "keyword",
					"count": 75059,
					"index_count": 1328,
					"script_count": 0
				},
				{
					"name": "long",
					"count": 5640,
					"index_count": 1315,
					"script_count": 0
				},
				{
					"name": "nested",
					"count": 1,
					"index_count": 1,
					"script_count": 0
				},
				{
					"name": "object",
					"count": 4821,
					"index_count": 1260,
					"script_count": 0
				},
				{
					"name": "text",
					"count": 74987,
					"index_count": 1328,
					"script_count": 0
				},
				{
					"name": "version",
					"count": 1,
					"index_count": 1,
					"script_count": 0
				}
			],
			"runtime_field_types": []
		},
		"analysis": {
			"char_filter_types": [],
			"tokenizer_types": [],
			"filter_types": [],
			"analyzer_types": [],
			"built_in_char_filters": [],
			"built_in_tokenizers": [],
			"built_in_filters": [],
			"built_in_analyzers": []
		},
		"versions": [
			{
				"version": "7.16.2",
				"index_count": 359,
				"primary_shard_count": 563,
				"total_primary_bytes": 2655566480472
			},
			{
				"version": "7.16.3",
				"index_count": 975,
				"primary_shard_count": 1554,
				"total_primary_bytes": 8873319103975
			}
		]
	},
	"nodes": {
		"count": {
			"total": 9,
			"coordinating_only": 3,
			"data": 5,
			"data_cold": 0,
			"data_content": 0,
			"data_frozen": 0,
			"data_hot": 0,
			"data_warm": 0,
			"ingest": 5,
			"master": 3,
			"ml": 0,
			"remote_cluster_client": 0,
			"transform": 0,
			"voting_only": 0
		},
		"versions": [
			"7.16.3"
		],
		"os": {
			"available_processors": 56,
			"allocated_processors": 56,
			"names": [
				{
					"name": "Linux",
					"count": 9
				}
			],
			"pretty_names": [
				{
					"pretty_name": "Ubuntu 20.04.6 LTS",
					"count": 9
				}
			],
			"architectures": [
				{
					"arch": "aarch64",
					"count": 9
				}
			],
			"mem": {
				"total_in_bytes": 465951612928,
				"free_in_bytes": 48858521600,
				"used_in_bytes": 417093091328,
				"free_percent": 10,
				"used_percent": 90
			}
		},
		"process": {
			"cpu": {
				"percent": 102
			},
			"open_file_descriptors": {
				"min": 511,
				"max": 5939,
				"avg": 3563
			}
		},
		"jvm": {
			"max_uptime_in_millis": 1461785332,
			"versions": [
				{
					"version": "17.0.1",
					"vm_name": "OpenJDK 64-Bit Server VM",
					"vm_version": "17.0.1+12",
					"vm_vendor": "Eclipse Adoptium",
					"bundled_jdk": true,
					"using_bundled_jdk": true,
					"count": 9
				}
			],
			"mem": {
				"heap_used_in_bytes": 136030844920,
				"heap_max_in_bytes": 240518168576
			},
			"threads": 697
		},
		"fs": {
			"total_in_bytes": 16608740769792,
			"free_in_bytes": 4912524836864,
			"available_in_bytes": 4912457728000
		},
		"plugins": [
			{
				"name": "repository-s3",
				"version": "7.16.3",
				"elasticsearch_version": "7.16.3",
				"java_version": "1.8",
				"description": "The S3 repository plugin adds S3 repositories",
				"classname": "org.elasticsearch.repositories.s3.S3RepositoryPlugin",
				"extended_plugins": [],
				"has_native_controller": false,
				"licensed": false,
				"type": "isolated"
			}
		],
		"network_types": {
			"transport_types": {
				"security4": 9
			},
			"http_types": {
				"security4": 9
			}
		},
		"discovery_types": {
			"zen": 9
		},
		"packaging_types": [
			{
				"flavor": "default",
				"type": "deb",
				"count": 9
			}
		],
		"ingest": {
			"number_of_pipelines": 2,
			"processor_stats": {
				"gsub": {
					"count": 0,
					"failed": 0,
					"current": 0,
					"time_in_millis": 0
				},
				"script": {
					"count": 0,
					"failed": 0,
					"current": 0,
					"time_in_millis": 0
				}
			}
		}
	}
}

Christian_Dahlqvist · September 28, 2023, 10:27am

Indexing in Elasticsearch is generally I/O intensive, so if your issues go away if you stop indexing it seems quite likely that your storage is too slow for the load you are putting on it. I would look at disk stats, e.g. using iostat -x, during periods when you are indexing and experiencing issues to see if your storage indeed is the bottleneck.

sanju1323 · September 28, 2023, 2:21pm

Understood @Christian_Dahlqvist,

Have captured iostat -x output while facing the issue.
Please find the iostat -x on the master node.

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.34    0.01    0.13    0.01    0.00   99.51

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz  aqu-sz  %util
loop0            0.00      0.05     0.00   0.00    0.18    28.75    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop1            0.00      0.00     0.00   0.00    1.08    22.33    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop2            0.00      0.00     0.00   0.00    0.67    29.44    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop3            0.00      0.00     0.00   0.00    0.55    24.09    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop4            0.00      0.00     0.00   0.00    1.31    17.56    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop5            0.00      0.01     0.00   0.00    0.31     8.60    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop6            0.00      0.00     0.00   0.00    0.90    10.20    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop7            0.00      0.00     0.00   0.00    1.00    29.33    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop8            0.00      0.18     0.00   0.00    0.24    38.67    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop9            0.00      0.03     0.00   0.00    0.03    27.38    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
nvme0n1          0.23      9.95     0.02   8.79    0.75    44.21    2.28     21.19     1.47  39.17    0.85     9.30    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.17

And on one of the data node,

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          15.63    0.00    0.91   39.78    0.00   43.68

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz  aqu-sz  %util
loop0            0.01      0.33     0.00   0.00    0.24    36.44    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop1            0.02      0.45     0.00   0.00    0.18    29.08    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop10           0.00      0.00     0.00   0.00    0.00     1.27    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop2            0.00      0.00     0.00   0.00    0.20    11.04    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop3            0.00      0.10     0.00   0.00    0.07    39.07    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop4            0.00      0.00     0.00   0.00    0.87    18.23    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop5            0.00      0.00     0.00   0.00    0.21    20.04    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop6            0.00      0.00     0.00   0.00    0.42    13.67    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop7            0.00      0.00     0.00   0.00    0.31    19.48    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop8            0.00      0.00     0.00   0.00    0.24     9.24    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop9            0.00      0.06     0.00   0.00    0.14    30.20    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
nvme0n1          0.91     33.47     0.13  12.29    0.63    36.80    1.98     34.00     1.00  33.51    1.49    17.19    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.26
nvme1n1          9.81    879.10     0.00   0.00   31.83    89.62  146.27   9975.63     5.51   3.63   15.37    68.20    0.00      0.00     0.00   0.00    0.00     0.00    0.31  88.89

Can you please have a look and suggest what has to be done.

Christian_Dahlqvist · September 28, 2023, 3:01pm

It looks like the data node is under quite heavy load with respect to disk I/O, and r_await and utilisation are quite high. I suspect this may be what is causing the operations to time out when the cluster is indexing. I would recommend upgrading to faster storage or perhaps reduce the frequency with which you gather stats. Another option may be to throttle indexing to reduce load.

system · October 26, 2023, 3:01pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
failed to retrieve shard stats from node [zxt4RAOiRZy9Lol9IdIGfg]: [node_2][10.202.152.18:9300][indices:monitor/stats[n]] request_id [77247 683] timed out after [15016ms] Elasticsearch	0	104	April 9, 2024
[o.e.t.TransportService] Received response for a request that has timed out Elasticsearch	17	704	August 28, 2023
Action [cluster:monitor/nodes/stats[n]] timed out Elasticsearch	9	1309	November 5, 2022
Collector [node_stats] timed out when collecting data Elasticsearch	2	663	March 25, 2019
Failed to retrieve shards stuck Elastic node and performance Elasticsearch elastic-stack-monitoring	0	92	June 18, 2024

Received response for a request that has timed out and "failed to retrieve stats for node"

Related topics