Hi,
we have a 9 nodes cluster in production that has been running for 10 days without any issues until a couple days ago when we increased the shard number from 2 shards & 1 replica to 3 shards & 1 replica.
Juste after the daily index rotation, i noticed a couple of timed out error on one of the data nodes.
Data node log :
[2020-01-28T00:10:55,636][ERROR][o.e.x.m.c.n.NodeStatsCollector] [es-data-1] collector [node_stats] timed out when collecting data
[2020-01-28T00:12:57,323][ERROR][o.e.x.m.c.n.NodeStatsCollector] [es-data-1] collector [node_stats] timed out when collecting data
[2020-01-28T03:30:37,604][ERROR][o.e.x.m.c.n.NodeStatsCollector] [es-data-1] collector [node_stats] timed out when collecting data
[2020-01-28T03:33:07,607][ERROR][o.e.x.m.c.n.NodeStatsCollector] [es-data-1] collector [node_stats] timed out when collecting data
[2020-01-28T03:33:47,608][ERROR][o.e.x.m.c.n.NodeStatsCollector] [es-data-1] collector [node_stats] timed out when collecting data
[2020-01-28T03:35:44,808][ERROR][o.e.x.m.c.n.NodeStatsCollector] [es-data-1] collector [node_stats] timed out when collecting data
[2020-01-28T03:36:57,346][WARN ][o.e.c.s.ClusterApplierService] [es-data-1] cluster state applier task [ApplyCommitRequest{term=15, version=49599, sourceNode={es-master-2}{DsrN9lejRcqINObszTz9Gg}{gqq6Qu8ARhqEd8MOVL5MhQ}{*********}{***.**.***.**:9300}{lm}{ml.machine_memory=33568231424, ml.max_open_jobs=20, xpack.installed=true}}] took [32.4s] which is above the warn threshold of [30s]: [running task [ApplyCommitRequest{term=15, version=49599, sourceNode={es-master-2}{DsrN9lejRcqINObszTz9Gg}{gqq6Qu8ARhqEd8MOVL5MhQ}{*********}{***.**.***.**:9300}{lm}{ml.machine_memory=33568231424, ml.max_open_jobs=20, xpack.installed=true}}]] took [0ms], [connecting to new nodes] took [0ms], [applying settings] took [0ms], [running applier [org.elasticsearch.indices.cluster.IndicesClusterStateService@e46e141]] took [0ms], [running applier [org.elasticsearch.script.ScriptService@1862b027]] took [0ms], [running applier [org.elasticsearch.xpack.ilm.IndexLifecycleService@463a0333]] took [0ms], [running applier [org.elasticsearch.repositories.RepositoriesService@7245e148]] took [0ms], [running applier [org.elasticsearch.snapshots.RestoreService@63f7280e]] took [0ms], [running applier [org.elasticsearch.ingest.IngestService@3e85f900]] took [0ms], [running applier [org.elasticsearch.action.ingest.IngestActionForwarder@44605575]] took [0ms], [running applier [org.elasticsearch.action.admin.cluster.repositories.cleanup.TransportCleanupRepositoryAction$$Lambda$3567/0x00000008015c2840@26fc6e01]] took [0ms], [running applier [org.elasticsearch.tasks.TaskManager@64b47aa3]] took [0ms], [running applier [org.elasticsearch.gateway.GatewayMetaState@4d51f35f]] took [11402ms], [notifying listener [org.elasticsearch.cluster.InternalClusterInfoService@7e30e4bb]] took [0ms], [notifying listener [org.elasticsearch.xpack.security.support.SecurityIndexManager@4cc32dd7]] took [0ms], [notifying listener [org.elasticsearch.xpack.security.support.SecurityIndexManager@289cdba9]] took [0ms], [notifying listener [org.elasticsearch.xpack.security.authc.TokenService$$Lambda$1972/0x000000080120c840@9108ad4]] took [0ms], [notifying listener [org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$$Lambda$2060/0x000000080124a840@70045992]] took [0ms], [notifying listener [org.elasticsearch.xpack.watcher.support.WatcherIndexTemplateRegistry@73a6d338]] took [0ms], [notifying listener [org.elasticsearch.xpack.watcher.WatcherLifeCycleService@3d291f76]] took [0ms], [notifying listener [org.elasticsearch.xpack.watcher.WatcherIndexingListener@52253a37]] took [0ms], [notifying listener [org.elasticsearch.xpack.ml.job.process.autodetect.AutodetectProcessManager@71c5ef42]] took [0ms], [notifying listener [org.elasticsearch.xpack.ml.datafeed.DatafeedManager$TaskRunner@376acd9b]] took [0ms], [notifying listener [org.elasticsearch.xpack.ml.MlInitializationService@778b687a]] took [0ms], [notifying listener [org.elasticsearch.xpack.ml.MlAssignmentNotifier@1819dfd2]] took [0ms], [notifying listener [org.elasticsearch.xpack.ilm.IndexLifecycleService@463a0333]] took [0ms], [notifying listener [org.elasticsearch.xpack.core.slm.history.SnapshotLifecycleTemplateRegistry@62ba62ab]] took [0ms], [notifying listener [org.elasticsearch.xpack.slm.SnapshotLifecycleService@3dd6d72]] took [0ms], [notifying listener [org.elasticsearch.xpack.ccr.action.ShardFollowTaskCleaner@2fb90dcb]] took [0ms], [notifying listener [org.elasticsearch.cluster.metadata.TemplateUpgradeService@7908a0a1]] took [0ms], [notifying listener [org.elasticsearch.node.ResponseCollectorService@48963a11]] took [0ms], [notifying listener [org.elasticsearch.snapshots.SnapshotShardsService@2048788c]] took [0ms], [notifying listener [org.elasticsearch.xpack.ml.action.TransportOpenJobAction$OpenJobPersistentTasksExecutor$$Lambda$3120/0x00000008014c7840@4778fe1b]] took [0ms], [notifying listener [org.elasticsearch.xpack.ml.action.TransportStartDataFrameAnalyticsAction$TaskExecutor$$Lambda$3124/0x00000008014c8840@7b4943fd]] took [0ms], [notifying listener [org.elasticsearch.persistent.PersistentTasksClusterService@2f6a837]] took [0ms], [notifying listener [org.elasticsearch.cluster.routing.DelayedAllocationService@61193cfd]] took [0ms], [notifying listener [org.elasticsearch.indices.store.IndicesStore@61919ee3]] took [0ms], [notifying listener [org.elasticsearch.gateway.DanglingIndicesState@6776d2ae]] took [20918ms], [notifying listener [org.elasticsearch.persistent.PersistentTasksNodeService@41c2546d]] took [0ms], [notifying listener [org.elasticsearch.license.LicenseService@10f01d59]] took [0ms], [notifying listener [org.elasticsearch.xpack.ccr.action.AutoFollowCoordinator@d22edf0]] took [0ms], [notifying listener [org.elasticsearch.gateway.GatewayService@69a6262]] took [0ms], [notifying listener [org.elasticsearch.cluster.service.ClusterApplierService$LocalNodeMasterListeners@70e812c8]] took [0ms]
[2020-01-28T03:39:36,751][ERROR][o.e.x.m.c.n.NodeStatsCollector] [es-data-1] collector [node_stats] timed out when collecting data
GET /_cluster/stats
{
"_nodes" : {
"total" : 10,
"successful" : 10,
"failed" : 0
},
"cluster_name" : "elastic",
"cluster_uuid" : "*********************",
"timestamp" : 1580228043195,
"status" : "green",
"indices" : {
"count" : 29,
"shards" : {
"total" : 86,
"primaries" : 43,
"replication" : 1.0,
"index" : {
"shards" : {
"min" : 2,
"max" : 6,
"avg" : 2.9655172413793105
},
"primaries" : {
"min" : 1,
"max" : 3,
"avg" : 1.4827586206896552
},
"replication" : {
"min" : 1.0,
"max" : 1.0,
"avg" : 1.0
}
}
},
"docs" : {
"count" : 104613938,
"deleted" : 561491
},
"store" : {
"size" : "471gb",
"size_in_bytes" : 505834915759
},
"fielddata" : {
"memory_size" : "350.9mb",
"memory_size_in_bytes" : 367951240,
"evictions" : 0
},
"query_cache" : {
"memory_size" : "79.9mb",
"memory_size_in_bytes" : 83793974,
"total_count" : 565215,
"hit_count" : 159661,
"miss_count" : 405554,
"cache_size" : 4799,
"cache_count" : 6093,
"evictions" : 1294
},
"completion" : {
"size" : "0b",
"size_in_bytes" : 0
},
"segments" : {
"count" : 1632,
"memory" : "191.5mb",
"memory_in_bytes" : 200901977,
"terms_memory" : "59.3mb",
"terms_memory_in_bytes" : 62263478,
"stored_fields_memory" : "121.5mb",
"stored_fields_memory_in_bytes" : 127473216,
"term_vectors_memory" : "0b",
"term_vectors_memory_in_bytes" : 0,
"norms_memory" : "4.5kb",
"norms_memory_in_bytes" : 4608,
"points_memory" : "6.7mb",
"points_memory_in_bytes" : 7059619,
"doc_values_memory" : "3.9mb",
"doc_values_memory_in_bytes" : 4101056,
"index_writer_memory" : "87mb",
"index_writer_memory_in_bytes" : 91306920,
"version_map_memory" : "1.7mb",
"version_map_memory_in_bytes" : 1786679,
"fixed_bit_set" : "551.9kb",
"fixed_bit_set_memory_in_bytes" : 565160,
"max_unsafe_auto_id_timestamp" : 1580169603576,
"file_sizes" : { }
}
},
"nodes" : {
"count" : {
"total" : 10,
"coordinating_only" : 0,
"data" : 3,
"ingest" : 4,
"master" : 3,
"ml" : 10,
"voting_only" : 0
},
"versions" : [
"7.4.2"
],
"os" : {
"available_processors" : 80,
"allocated_processors" : 80,
"names" : [
{
"name" : "Linux",
"count" : 10
}
],
"pretty_names" : [
{
"pretty_name" : "Red Hat Enterprise Linux Server 7.4 (Maipo)",
"count" : 10
}
],
"mem" : {
"total" : "312.6gb",
"total_in_bytes" : 335682314240,
"free" : "36.6gb",
"free_in_bytes" : 39326842880,
"used" : "276gb",
"used_in_bytes" : 296355471360,
"free_percent" : 12,
"used_percent" : 88
}
},
"process" : {
"cpu" : {
"percent" : 298
},
"open_file_descriptors" : {
"min" : 480,
"max" : 885,
"avg" : 602
}
},
"jvm" : {
"max_uptime" : "1.1d",
"max_uptime_in_millis" : 96586237,
"versions" : [
{
"version" : "13.0.1",
"vm_name" : "OpenJDK 64-Bit Server VM",
"vm_version" : "13.0.1+9",
"vm_vendor" : "AdoptOpenJDK",
"bundled_jdk" : true,
"using_bundled_jdk" : true,
"count" : 10
}
],
"mem" : {
"heap_used" : "9gb",
"heap_used_in_bytes" : 9754097888,
"heap_max" : "33.3gb",
"heap_max_in_bytes" : 35809918976
},
"threads" : 658
},
"fs" : {
"total" : "6tb",
"total_in_bytes" : 6597069766656,
"free" : "3.1tb",
"free_in_bytes" : 3474412273664,
"available" : "3.1tb",
"available_in_bytes" : 3474412273664
},
"plugins" : [ ],
"network_types" : {
"transport_types" : {
"security4" : 10
},
"http_types" : {
"security4" : 10
}
},
"discovery_types" : {
"zen" : 10
},
"packaging_types" : [
{
"flavor" : "default",
"type" : "tar",
"count" : 10
}
]
}
}
I am not sure about the reason behind this error, could it be that adding a shard caused the data nodes to timeout?
Do you think that increasing the data nodes heap size would resolve this issue ?
The thing is it only hapends around the time went we take snapshots.
Any help is appreciated, thank you in advance.