Data Node Error [Timed out when collecting data error]

Hi,

we have a 9 nodes cluster in production that has been running for 10 days without any issues until a couple days ago when we increased the shard number from 2 shards & 1 replica to 3 shards & 1 replica.

Juste after the daily index rotation, i noticed a couple of timed out error on one of the data nodes.

Data node log :

[2020-01-28T00:10:55,636][ERROR][o.e.x.m.c.n.NodeStatsCollector] [es-data-1] collector [node_stats] timed out when collecting data
[2020-01-28T00:12:57,323][ERROR][o.e.x.m.c.n.NodeStatsCollector] [es-data-1] collector [node_stats] timed out when collecting data
[2020-01-28T03:30:37,604][ERROR][o.e.x.m.c.n.NodeStatsCollector] [es-data-1] collector [node_stats] timed out when collecting data
[2020-01-28T03:33:07,607][ERROR][o.e.x.m.c.n.NodeStatsCollector] [es-data-1] collector [node_stats] timed out when collecting data
[2020-01-28T03:33:47,608][ERROR][o.e.x.m.c.n.NodeStatsCollector] [es-data-1] collector [node_stats] timed out when collecting data
[2020-01-28T03:35:44,808][ERROR][o.e.x.m.c.n.NodeStatsCollector] [es-data-1] collector [node_stats] timed out when collecting data
[2020-01-28T03:36:57,346][WARN ][o.e.c.s.ClusterApplierService] [es-data-1] cluster state applier task [ApplyCommitRequest{term=15, version=49599, sourceNode={es-master-2}{DsrN9lejRcqINObszTz9Gg}{gqq6Qu8ARhqEd8MOVL5MhQ}{*********}{***.**.***.**:9300}{lm}{ml.machine_memory=33568231424, ml.max_open_jobs=20, xpack.installed=true}}] took [32.4s] which is above the warn threshold of [30s]: [running task [ApplyCommitRequest{term=15, version=49599, sourceNode={es-master-2}{DsrN9lejRcqINObszTz9Gg}{gqq6Qu8ARhqEd8MOVL5MhQ}{*********}{***.**.***.**:9300}{lm}{ml.machine_memory=33568231424, ml.max_open_jobs=20, xpack.installed=true}}]] took [0ms], [connecting to new nodes] took [0ms], [applying settings] took [0ms], [running applier [org.elasticsearch.indices.cluster.IndicesClusterStateService@e46e141]] took [0ms], [running applier [org.elasticsearch.script.ScriptService@1862b027]] took [0ms], [running applier [org.elasticsearch.xpack.ilm.IndexLifecycleService@463a0333]] took [0ms], [running applier [org.elasticsearch.repositories.RepositoriesService@7245e148]] took [0ms], [running applier [org.elasticsearch.snapshots.RestoreService@63f7280e]] took [0ms], [running applier [org.elasticsearch.ingest.IngestService@3e85f900]] took [0ms], [running applier [org.elasticsearch.action.ingest.IngestActionForwarder@44605575]] took [0ms], [running applier [org.elasticsearch.action.admin.cluster.repositories.cleanup.TransportCleanupRepositoryAction$$Lambda$3567/0x00000008015c2840@26fc6e01]] took [0ms], [running applier [org.elasticsearch.tasks.TaskManager@64b47aa3]] took [0ms], [running applier [org.elasticsearch.gateway.GatewayMetaState@4d51f35f]] took [11402ms], [notifying listener [org.elasticsearch.cluster.InternalClusterInfoService@7e30e4bb]] took [0ms], [notifying listener [org.elasticsearch.xpack.security.support.SecurityIndexManager@4cc32dd7]] took [0ms], [notifying listener [org.elasticsearch.xpack.security.support.SecurityIndexManager@289cdba9]] took [0ms], [notifying listener [org.elasticsearch.xpack.security.authc.TokenService$$Lambda$1972/0x000000080120c840@9108ad4]] took [0ms], [notifying listener [org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$$Lambda$2060/0x000000080124a840@70045992]] took [0ms], [notifying listener [org.elasticsearch.xpack.watcher.support.WatcherIndexTemplateRegistry@73a6d338]] took [0ms], [notifying listener [org.elasticsearch.xpack.watcher.WatcherLifeCycleService@3d291f76]] took [0ms], [notifying listener [org.elasticsearch.xpack.watcher.WatcherIndexingListener@52253a37]] took [0ms], [notifying listener [org.elasticsearch.xpack.ml.job.process.autodetect.AutodetectProcessManager@71c5ef42]] took [0ms], [notifying listener [org.elasticsearch.xpack.ml.datafeed.DatafeedManager$TaskRunner@376acd9b]] took [0ms], [notifying listener [org.elasticsearch.xpack.ml.MlInitializationService@778b687a]] took [0ms], [notifying listener [org.elasticsearch.xpack.ml.MlAssignmentNotifier@1819dfd2]] took [0ms], [notifying listener [org.elasticsearch.xpack.ilm.IndexLifecycleService@463a0333]] took [0ms], [notifying listener [org.elasticsearch.xpack.core.slm.history.SnapshotLifecycleTemplateRegistry@62ba62ab]] took [0ms], [notifying listener [org.elasticsearch.xpack.slm.SnapshotLifecycleService@3dd6d72]] took [0ms], [notifying listener [org.elasticsearch.xpack.ccr.action.ShardFollowTaskCleaner@2fb90dcb]] took [0ms], [notifying listener [org.elasticsearch.cluster.metadata.TemplateUpgradeService@7908a0a1]] took [0ms], [notifying listener [org.elasticsearch.node.ResponseCollectorService@48963a11]] took [0ms], [notifying listener [org.elasticsearch.snapshots.SnapshotShardsService@2048788c]] took [0ms], [notifying listener [org.elasticsearch.xpack.ml.action.TransportOpenJobAction$OpenJobPersistentTasksExecutor$$Lambda$3120/0x00000008014c7840@4778fe1b]] took [0ms], [notifying listener [org.elasticsearch.xpack.ml.action.TransportStartDataFrameAnalyticsAction$TaskExecutor$$Lambda$3124/0x00000008014c8840@7b4943fd]] took [0ms], [notifying listener [org.elasticsearch.persistent.PersistentTasksClusterService@2f6a837]] took [0ms], [notifying listener [org.elasticsearch.cluster.routing.DelayedAllocationService@61193cfd]] took [0ms], [notifying listener [org.elasticsearch.indices.store.IndicesStore@61919ee3]] took [0ms], [notifying listener [org.elasticsearch.gateway.DanglingIndicesState@6776d2ae]] took [20918ms], [notifying listener [org.elasticsearch.persistent.PersistentTasksNodeService@41c2546d]] took [0ms], [notifying listener [org.elasticsearch.license.LicenseService@10f01d59]] took [0ms], [notifying listener [org.elasticsearch.xpack.ccr.action.AutoFollowCoordinator@d22edf0]] took [0ms], [notifying listener [org.elasticsearch.gateway.GatewayService@69a6262]] took [0ms], [notifying listener [org.elasticsearch.cluster.service.ClusterApplierService$LocalNodeMasterListeners@70e812c8]] took [0ms]
[2020-01-28T03:39:36,751][ERROR][o.e.x.m.c.n.NodeStatsCollector] [es-data-1] collector [node_stats] timed out when collecting data

GET /_cluster/stats

{
  "_nodes" : {
    "total" : 10,
    "successful" : 10,
    "failed" : 0
  },
  "cluster_name" : "elastic",
  "cluster_uuid" : "*********************",
  "timestamp" : 1580228043195,
  "status" : "green",
  "indices" : {
    "count" : 29,
    "shards" : {
      "total" : 86,
      "primaries" : 43,
      "replication" : 1.0,
      "index" : {
        "shards" : {
          "min" : 2,
          "max" : 6,
          "avg" : 2.9655172413793105
        },
        "primaries" : {
          "min" : 1,
          "max" : 3,
          "avg" : 1.4827586206896552
        },
        "replication" : {
          "min" : 1.0,
          "max" : 1.0,
          "avg" : 1.0
        }
      }
    },
    "docs" : {
      "count" : 104613938,
      "deleted" : 561491
    },
    "store" : {
      "size" : "471gb",
      "size_in_bytes" : 505834915759
    },
    "fielddata" : {
      "memory_size" : "350.9mb",
      "memory_size_in_bytes" : 367951240,
      "evictions" : 0
    },
    "query_cache" : {
      "memory_size" : "79.9mb",
      "memory_size_in_bytes" : 83793974,
      "total_count" : 565215,
      "hit_count" : 159661,
      "miss_count" : 405554,
      "cache_size" : 4799,
      "cache_count" : 6093,
      "evictions" : 1294
    },
    "completion" : {
      "size" : "0b",
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 1632,
      "memory" : "191.5mb",
      "memory_in_bytes" : 200901977,
      "terms_memory" : "59.3mb",
      "terms_memory_in_bytes" : 62263478,
      "stored_fields_memory" : "121.5mb",
      "stored_fields_memory_in_bytes" : 127473216,
      "term_vectors_memory" : "0b",
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory" : "4.5kb",
      "norms_memory_in_bytes" : 4608,
      "points_memory" : "6.7mb",
      "points_memory_in_bytes" : 7059619,
      "doc_values_memory" : "3.9mb",
      "doc_values_memory_in_bytes" : 4101056,
      "index_writer_memory" : "87mb",
      "index_writer_memory_in_bytes" : 91306920,
      "version_map_memory" : "1.7mb",
      "version_map_memory_in_bytes" : 1786679,
      "fixed_bit_set" : "551.9kb",
      "fixed_bit_set_memory_in_bytes" : 565160,
      "max_unsafe_auto_id_timestamp" : 1580169603576,
      "file_sizes" : { }
    }
  },
  "nodes" : {
    "count" : {
      "total" : 10,
      "coordinating_only" : 0,
      "data" : 3,
      "ingest" : 4,
      "master" : 3,
      "ml" : 10,
      "voting_only" : 0
    },
    "versions" : [
      "7.4.2"
    ],
    "os" : {
      "available_processors" : 80,
      "allocated_processors" : 80,
      "names" : [
        {
          "name" : "Linux",
          "count" : 10
        }
      ],
      "pretty_names" : [
        {
          "pretty_name" : "Red Hat Enterprise Linux Server 7.4 (Maipo)",
          "count" : 10
        }
      ],
      "mem" : {
        "total" : "312.6gb",
        "total_in_bytes" : 335682314240,
        "free" : "36.6gb",
        "free_in_bytes" : 39326842880,
        "used" : "276gb",
        "used_in_bytes" : 296355471360,
        "free_percent" : 12,
        "used_percent" : 88
      }
    },
    "process" : {
      "cpu" : {
        "percent" : 298
      },
      "open_file_descriptors" : {
        "min" : 480,
        "max" : 885,
        "avg" : 602
      }
    },
    "jvm" : {
      "max_uptime" : "1.1d",
      "max_uptime_in_millis" : 96586237,
      "versions" : [
        {
          "version" : "13.0.1",
          "vm_name" : "OpenJDK 64-Bit Server VM",
          "vm_version" : "13.0.1+9",
          "vm_vendor" : "AdoptOpenJDK",
          "bundled_jdk" : true,
          "using_bundled_jdk" : true,
          "count" : 10
        }
      ],
      "mem" : {
        "heap_used" : "9gb",
        "heap_used_in_bytes" : 9754097888,
        "heap_max" : "33.3gb",
        "heap_max_in_bytes" : 35809918976
      },
      "threads" : 658
    },
    "fs" : {
      "total" : "6tb",
      "total_in_bytes" : 6597069766656,
      "free" : "3.1tb",
      "free_in_bytes" : 3474412273664,
      "available" : "3.1tb",
      "available_in_bytes" : 3474412273664
    },
    "plugins" : [ ],
    "network_types" : {
      "transport_types" : {
        "security4" : 10
      },
      "http_types" : {
        "security4" : 10
      }
    },
    "discovery_types" : {
      "zen" : 10
    },
    "packaging_types" : [
      {
        "flavor" : "default",
        "type" : "tar",
        "count" : 10
      }
    ]
  }
}

I am not sure about the reason behind this error, could it be that adding a shard caused the data nodes to timeout?
Do you think that increasing the data nodes heap size would resolve this issue ?

The thing is it only hapends around the time went we take snapshots.

Any help is appreciated, thank you in advance.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.