Received response for a request that has timed out and "failed to retrieve stats for node"

Hi..

I have a 5 node cluster where we are receiving data around 400 GB per day. Disk in the cluster is at 70% mark on all the nodes.
We have observed slowness in running simple queries like _cat/indices also. When we checked the logs, we are continuously getting the below Warning messages.

[2023-09-27T09:09:13,854][WARN ][o.e.c.InternalClusterInfoService] [master-node] failed to retrieve stats for node [PYlSvaErTTWGns-QCk4H9A]: [data-node-4][x.x.x.x:9300][cluster:monitor/nodes/stats[n]] request_id [340711879] timed out after [15005ms]
[2023-09-27T09:09:13,856][WARN ][o.e.c.InternalClusterInfoService] [master-node] failed to retrieve shard stats from node [PYlSvaErTTWGns-QCk4H9A]: [data-node-4][x.x.x.x:9300][indices:monitor/stats[n]] request_id [340711891] timed out after [15005ms]
[2023-09-27T09:09:24,962][WARN ][o.e.t.TransportService   ] [master-node] Received response for a request that has timed out, sent [26.2s/26209ms] ago, timed out [11.2s/11204ms] ago, action [cluster:monitor/nodes/stats[n]], node [{data-node-4}{PYlSvaErTTWGns-QCk4H9A}{KR0hKPZjRO26lyYBilY3oQ}{x.x.x.x}{x.x.x.x:9300}{dimv}{xpack.installed=true, transform.node=false}], id [340711879]
[2023-09-27T09:09:25,013][WARN ][o.e.t.TransportService   ] [master-node] Received response for a request that has timed out, sent [26.2s/26209ms] ago, timed out [11.2s/11204ms] ago, action [indices:monitor/stats[n]], node [{data-node-4}{PYlSvaErTTWGns-QCk4H9A}{KR0hKPZjRO26lyYBilY3oQ}{x.x.x.x}{x.x.x.x:9300}{dimv}{xpack.installed=true, transform.node=false}], id [340711891]

When there is no indexing load, the cluster is behaving well.
Also, we have an index with 400 GB of data with 20 shards to it. Is that making any issue.
Not sure what exactly happening in the cluster.
Can someone help us to resolve this issue.

Hi Sanjay, what version are you using? Also I believe you might have a support contract, have you opened a support case about this?

Thanks @DavidTurner for you response.

We are using 7.16.3 version of Elasticsearch and at present, we lack a support contract, so we handle cluster maintenance ourselves.

What is the full output of the cluster stats API?

What type of storage are you using? Local SSDs?

@Christian_Dahlqvist ,

We are using AWS Graviton based servers with HDD.

Please find the cluster stats below

{
	"_nodes": {
		"total": 9,
		"successful": 9,
		"failed": 0
	},
	"cluster_name": "logsearch",
	"cluster_uuid": "cUEYj2HwbggGh31RgCFTO5",
	"timestamp": 1695886773943,
	"status": "green",
	"indices": {
		"count": 1334,
		"shards": {
			"total": 2168,
			"primaries": 2117,
			"replication": 0.024090694378837979,
			"index": {
				"shards": {
					"min": 1,
					"max": 20,
					"avg": 1.6251874062968517
				},
				"primaries": {
					"min": 1,
					"max": 20,
					"avg": 1.5869565217391304
				},
				"replication": {
					"min": 0.0,
					"max": 1.0,
					"avg": 0.03823088455772114
				}
			}
		},
		"docs": {
			"count": 14138344965,
			"deleted": 171333523
		},
		"store": {
			"size_in_bytes": 11529600888415,
			"total_data_set_size_in_bytes": 11529600888415,
			"reserved_in_bytes": 0
		},
		"fielddata": {
			"memory_size_in_bytes": 4496812096,
			"evictions": 0
		},
		"query_cache": {
			"memory_size_in_bytes": 624981767,
			"total_count": 9058293774,
			"hit_count": 526847372,
			"miss_count": 8531446402,
			"cache_size": 632479,
			"cache_count": 27701572,
			"evictions": 27069093
		},
		"completion": {
			"size_in_bytes": 0
		},
		"segments": {
			"count": 38949,
			"memory_in_bytes": 1811063204,
			"terms_memory_in_bytes": 1594972056,
			"stored_fields_memory_in_bytes": 62419056,
			"term_vectors_memory_in_bytes": 0,
			"norms_memory_in_bytes": 1058560,
			"points_memory_in_bytes": 0,
			"doc_values_memory_in_bytes": 152613532,
			"index_writer_memory_in_bytes": 322627488,
			"version_map_memory_in_bytes": 27351944,
			"fixed_bit_set_memory_in_bytes": 2264,
			"max_unsafe_auto_id_timestamp": 1695886687017,
			"file_sizes": {}
		},
		"mappings": {
			"field_types": [
				{
					"name": "boolean",
					"count": 576,
					"index_count": 206,
					"script_count": 0
				},
				{
					"name": "constant_keyword",
					"count": 6,
					"index_count": 2,
					"script_count": 0
				},
				{
					"name": "date",
					"count": 3369,
					"index_count": 1324,
					"script_count": 0
				},
				{
					"name": "float",
					"count": 1874,
					"index_count": 411,
					"script_count": 0
				},
				{
					"name": "geo_point",
					"count": 1232,
					"index_count": 1232,
					"script_count": 0
				},
				{
					"name": "half_float",
					"count": 2464,
					"index_count": 1232,
					"script_count": 0
				},
				{
					"name": "integer",
					"count": 688,
					"index_count": 344,
					"script_count": 0
				},
				{
					"name": "ip",
					"count": 4782,
					"index_count": 1234,
					"script_count": 0
				},
				{
					"name": "keyword",
					"count": 75059,
					"index_count": 1328,
					"script_count": 0
				},
				{
					"name": "long",
					"count": 5640,
					"index_count": 1315,
					"script_count": 0
				},
				{
					"name": "nested",
					"count": 1,
					"index_count": 1,
					"script_count": 0
				},
				{
					"name": "object",
					"count": 4821,
					"index_count": 1260,
					"script_count": 0
				},
				{
					"name": "text",
					"count": 74987,
					"index_count": 1328,
					"script_count": 0
				},
				{
					"name": "version",
					"count": 1,
					"index_count": 1,
					"script_count": 0
				}
			],
			"runtime_field_types": []
		},
		"analysis": {
			"char_filter_types": [],
			"tokenizer_types": [],
			"filter_types": [],
			"analyzer_types": [],
			"built_in_char_filters": [],
			"built_in_tokenizers": [],
			"built_in_filters": [],
			"built_in_analyzers": []
		},
		"versions": [
			{
				"version": "7.16.2",
				"index_count": 359,
				"primary_shard_count": 563,
				"total_primary_bytes": 2655566480472
			},
			{
				"version": "7.16.3",
				"index_count": 975,
				"primary_shard_count": 1554,
				"total_primary_bytes": 8873319103975
			}
		]
	},
	"nodes": {
		"count": {
			"total": 9,
			"coordinating_only": 3,
			"data": 5,
			"data_cold": 0,
			"data_content": 0,
			"data_frozen": 0,
			"data_hot": 0,
			"data_warm": 0,
			"ingest": 5,
			"master": 3,
			"ml": 0,
			"remote_cluster_client": 0,
			"transform": 0,
			"voting_only": 0
		},
		"versions": [
			"7.16.3"
		],
		"os": {
			"available_processors": 56,
			"allocated_processors": 56,
			"names": [
				{
					"name": "Linux",
					"count": 9
				}
			],
			"pretty_names": [
				{
					"pretty_name": "Ubuntu 20.04.6 LTS",
					"count": 9
				}
			],
			"architectures": [
				{
					"arch": "aarch64",
					"count": 9
				}
			],
			"mem": {
				"total_in_bytes": 465951612928,
				"free_in_bytes": 48858521600,
				"used_in_bytes": 417093091328,
				"free_percent": 10,
				"used_percent": 90
			}
		},
		"process": {
			"cpu": {
				"percent": 102
			},
			"open_file_descriptors": {
				"min": 511,
				"max": 5939,
				"avg": 3563
			}
		},
		"jvm": {
			"max_uptime_in_millis": 1461785332,
			"versions": [
				{
					"version": "17.0.1",
					"vm_name": "OpenJDK 64-Bit Server VM",
					"vm_version": "17.0.1+12",
					"vm_vendor": "Eclipse Adoptium",
					"bundled_jdk": true,
					"using_bundled_jdk": true,
					"count": 9
				}
			],
			"mem": {
				"heap_used_in_bytes": 136030844920,
				"heap_max_in_bytes": 240518168576
			},
			"threads": 697
		},
		"fs": {
			"total_in_bytes": 16608740769792,
			"free_in_bytes": 4912524836864,
			"available_in_bytes": 4912457728000
		},
		"plugins": [
			{
				"name": "repository-s3",
				"version": "7.16.3",
				"elasticsearch_version": "7.16.3",
				"java_version": "1.8",
				"description": "The S3 repository plugin adds S3 repositories",
				"classname": "org.elasticsearch.repositories.s3.S3RepositoryPlugin",
				"extended_plugins": [],
				"has_native_controller": false,
				"licensed": false,
				"type": "isolated"
			}
		],
		"network_types": {
			"transport_types": {
				"security4": 9
			},
			"http_types": {
				"security4": 9
			}
		},
		"discovery_types": {
			"zen": 9
		},
		"packaging_types": [
			{
				"flavor": "default",
				"type": "deb",
				"count": 9
			}
		],
		"ingest": {
			"number_of_pipelines": 2,
			"processor_stats": {
				"gsub": {
					"count": 0,
					"failed": 0,
					"current": 0,
					"time_in_millis": 0
				},
				"script": {
					"count": 0,
					"failed": 0,
					"current": 0,
					"time_in_millis": 0
				}
			}
		}
	}
}

Indexing in Elasticsearch is generally I/O intensive, so if your issues go away if you stop indexing it seems quite likely that your storage is too slow for the load you are putting on it. I would look at disk stats, e.g. using iostat -x, during periods when you are indexing and experiencing issues to see if your storage indeed is the bottleneck.

Understood @Christian_Dahlqvist,

Have captured iostat -x output while facing the issue.
Please find the iostat -x on the master node.

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.34    0.01    0.13    0.01    0.00   99.51

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz  aqu-sz  %util
loop0            0.00      0.05     0.00   0.00    0.18    28.75    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop1            0.00      0.00     0.00   0.00    1.08    22.33    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop2            0.00      0.00     0.00   0.00    0.67    29.44    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop3            0.00      0.00     0.00   0.00    0.55    24.09    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop4            0.00      0.00     0.00   0.00    1.31    17.56    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop5            0.00      0.01     0.00   0.00    0.31     8.60    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop6            0.00      0.00     0.00   0.00    0.90    10.20    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop7            0.00      0.00     0.00   0.00    1.00    29.33    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop8            0.00      0.18     0.00   0.00    0.24    38.67    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop9            0.00      0.03     0.00   0.00    0.03    27.38    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
nvme0n1          0.23      9.95     0.02   8.79    0.75    44.21    2.28     21.19     1.47  39.17    0.85     9.30    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.17


And on one of the data node,

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          15.63    0.00    0.91   39.78    0.00   43.68

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz  aqu-sz  %util
loop0            0.01      0.33     0.00   0.00    0.24    36.44    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop1            0.02      0.45     0.00   0.00    0.18    29.08    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop10           0.00      0.00     0.00   0.00    0.00     1.27    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop2            0.00      0.00     0.00   0.00    0.20    11.04    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop3            0.00      0.10     0.00   0.00    0.07    39.07    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop4            0.00      0.00     0.00   0.00    0.87    18.23    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop5            0.00      0.00     0.00   0.00    0.21    20.04    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop6            0.00      0.00     0.00   0.00    0.42    13.67    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop7            0.00      0.00     0.00   0.00    0.31    19.48    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop8            0.00      0.00     0.00   0.00    0.24     9.24    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
loop9            0.00      0.06     0.00   0.00    0.14    30.20    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.00
nvme0n1          0.91     33.47     0.13  12.29    0.63    36.80    1.98     34.00     1.00  33.51    1.49    17.19    0.00      0.00     0.00   0.00    0.00     0.00    0.00   0.26
nvme1n1          9.81    879.10     0.00   0.00   31.83    89.62  146.27   9975.63     5.51   3.63   15.37    68.20    0.00      0.00     0.00   0.00    0.00     0.00    0.31  88.89


Can you please have a look and suggest what has to be done.

It looks like the data node is under quite heavy load with respect to disk I/O, and r_await and utilisation are quite high. I suspect this may be what is causing the operations to time out when the cluster is indexing. I would recommend upgrading to faster storage or perhaps reduce the frequency with which you gather stats. Another option may be to throttle indexing to reduce load.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.