Data Node Error [Timed out when collecting data error]

B.M · January 28, 2020, 4:36pm

Hi,

we have a 9 nodes cluster in production that has been running for 10 days without any issues until a couple days ago when we increased the shard number from 2 shards & 1 replica to 3 shards & 1 replica.

Juste after the daily index rotation, i noticed a couple of timed out error on one of the data nodes.

Data node log :

[2020-01-28T00:10:55,636][ERROR][o.e.x.m.c.n.NodeStatsCollector] [es-data-1] collector [node_stats] timed out when collecting data
[2020-01-28T00:12:57,323][ERROR][o.e.x.m.c.n.NodeStatsCollector] [es-data-1] collector [node_stats] timed out when collecting data
[2020-01-28T03:30:37,604][ERROR][o.e.x.m.c.n.NodeStatsCollector] [es-data-1] collector [node_stats] timed out when collecting data
[2020-01-28T03:33:07,607][ERROR][o.e.x.m.c.n.NodeStatsCollector] [es-data-1] collector [node_stats] timed out when collecting data
[2020-01-28T03:33:47,608][ERROR][o.e.x.m.c.n.NodeStatsCollector] [es-data-1] collector [node_stats] timed out when collecting data
[2020-01-28T03:35:44,808][ERROR][o.e.x.m.c.n.NodeStatsCollector] [es-data-1] collector [node_stats] timed out when collecting data
[2020-01-28T03:36:57,346][WARN ][o.e.c.s.ClusterApplierService] [es-data-1] cluster state applier task [ApplyCommitRequest{term=15, version=49599, sourceNode={es-master-2}{DsrN9lejRcqINObszTz9Gg}{gqq6Qu8ARhqEd8MOVL5MhQ}{*********}{***.**.***.**:9300}{lm}{ml.machine_memory=33568231424, ml.max_open_jobs=20, xpack.installed=true}}] took [32.4s] which is above the warn threshold of [30s]: [running task [ApplyCommitRequest{term=15, version=49599, sourceNode={es-master-2}{DsrN9lejRcqINObszTz9Gg}{gqq6Qu8ARhqEd8MOVL5MhQ}{*********}{***.**.***.**:9300}{lm}{ml.machine_memory=33568231424, ml.max_open_jobs=20, xpack.installed=true}}]] took [0ms], [connecting to new nodes] took [0ms], [applying settings] took [0ms], [running applier [org.elasticsearch.indices.cluster.IndicesClusterStateService@e46e141]] took [0ms], [running applier [org.elasticsearch.script.ScriptService@1862b027]] took [0ms], [running applier [org.elasticsearch.xpack.ilm.IndexLifecycleService@463a0333]] took [0ms], [running applier [org.elasticsearch.repositories.RepositoriesService@7245e148]] took [0ms], [running applier [org.elasticsearch.snapshots.RestoreService@63f7280e]] took [0ms], [running applier [org.elasticsearch.ingest.IngestService@3e85f900]] took [0ms], [running applier [org.elasticsearch.action.ingest.IngestActionForwarder@44605575]] took [0ms], [running applier [org.elasticsearch.action.admin.cluster.repositories.cleanup.TransportCleanupRepositoryAction$$Lambda$3567/0x00000008015c2840@26fc6e01]] took [0ms], [running applier [org.elasticsearch.tasks.TaskManager@64b47aa3]] took [0ms], [running applier [org.elasticsearch.gateway.GatewayMetaState@4d51f35f]] took [11402ms], [notifying listener [org.elasticsearch.cluster.InternalClusterInfoService@7e30e4bb]] took [0ms], [notifying listener [org.elasticsearch.xpack.security.support.SecurityIndexManager@4cc32dd7]] took [0ms], [notifying listener [org.elasticsearch.xpack.security.support.SecurityIndexManager@289cdba9]] took [0ms], [notifying listener [org.elasticsearch.xpack.security.authc.TokenService$$Lambda$1972/0x000000080120c840@9108ad4]] took [0ms], [notifying listener [org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$$Lambda$2060/0x000000080124a840@70045992]] took [0ms], [notifying listener [org.elasticsearch.xpack.watcher.support.WatcherIndexTemplateRegistry@73a6d338]] took [0ms], [notifying listener [org.elasticsearch.xpack.watcher.WatcherLifeCycleService@3d291f76]] took [0ms], [notifying listener [org.elasticsearch.xpack.watcher.WatcherIndexingListener@52253a37]] took [0ms], [notifying listener [org.elasticsearch.xpack.ml.job.process.autodetect.AutodetectProcessManager@71c5ef42]] took [0ms], [notifying listener [org.elasticsearch.xpack.ml.datafeed.DatafeedManager$TaskRunner@376acd9b]] took [0ms], [notifying listener [org.elasticsearch.xpack.ml.MlInitializationService@778b687a]] took [0ms], [notifying listener [org.elasticsearch.xpack.ml.MlAssignmentNotifier@1819dfd2]] took [0ms], [notifying listener [org.elasticsearch.xpack.ilm.IndexLifecycleService@463a0333]] took [0ms], [notifying listener [org.elasticsearch.xpack.core.slm.history.SnapshotLifecycleTemplateRegistry@62ba62ab]] took [0ms], [notifying listener [org.elasticsearch.xpack.slm.SnapshotLifecycleService@3dd6d72]] took [0ms], [notifying listener [org.elasticsearch.xpack.ccr.action.ShardFollowTaskCleaner@2fb90dcb]] took [0ms], [notifying listener [org.elasticsearch.cluster.metadata.TemplateUpgradeService@7908a0a1]] took [0ms], [notifying listener [org.elasticsearch.node.ResponseCollectorService@48963a11]] took [0ms], [notifying listener [org.elasticsearch.snapshots.SnapshotShardsService@2048788c]] took [0ms], [notifying listener [org.elasticsearch.xpack.ml.action.TransportOpenJobAction$OpenJobPersistentTasksExecutor$$Lambda$3120/0x00000008014c7840@4778fe1b]] took [0ms], [notifying listener [org.elasticsearch.xpack.ml.action.TransportStartDataFrameAnalyticsAction$TaskExecutor$$Lambda$3124/0x00000008014c8840@7b4943fd]] took [0ms], [notifying listener [org.elasticsearch.persistent.PersistentTasksClusterService@2f6a837]] took [0ms], [notifying listener [org.elasticsearch.cluster.routing.DelayedAllocationService@61193cfd]] took [0ms], [notifying listener [org.elasticsearch.indices.store.IndicesStore@61919ee3]] took [0ms], [notifying listener [org.elasticsearch.gateway.DanglingIndicesState@6776d2ae]] took [20918ms], [notifying listener [org.elasticsearch.persistent.PersistentTasksNodeService@41c2546d]] took [0ms], [notifying listener [org.elasticsearch.license.LicenseService@10f01d59]] took [0ms], [notifying listener [org.elasticsearch.xpack.ccr.action.AutoFollowCoordinator@d22edf0]] took [0ms], [notifying listener [org.elasticsearch.gateway.GatewayService@69a6262]] took [0ms], [notifying listener [org.elasticsearch.cluster.service.ClusterApplierService$LocalNodeMasterListeners@70e812c8]] took [0ms]
[2020-01-28T03:39:36,751][ERROR][o.e.x.m.c.n.NodeStatsCollector] [es-data-1] collector [node_stats] timed out when collecting data

GET /_cluster/stats

{
  "_nodes" : {
    "total" : 10,
    "successful" : 10,
    "failed" : 0
  },
  "cluster_name" : "elastic",
  "cluster_uuid" : "*********************",
  "timestamp" : 1580228043195,
  "status" : "green",
  "indices" : {
    "count" : 29,
    "shards" : {
      "total" : 86,
      "primaries" : 43,
      "replication" : 1.0,
      "index" : {
        "shards" : {
          "min" : 2,
          "max" : 6,
          "avg" : 2.9655172413793105
        },
        "primaries" : {
          "min" : 1,
          "max" : 3,
          "avg" : 1.4827586206896552
        },
        "replication" : {
          "min" : 1.0,
          "max" : 1.0,
          "avg" : 1.0
        }
      }
    },
    "docs" : {
      "count" : 104613938,
      "deleted" : 561491
    },
    "store" : {
      "size" : "471gb",
      "size_in_bytes" : 505834915759
    },
    "fielddata" : {
      "memory_size" : "350.9mb",
      "memory_size_in_bytes" : 367951240,
      "evictions" : 0
    },
    "query_cache" : {
      "memory_size" : "79.9mb",
      "memory_size_in_bytes" : 83793974,
      "total_count" : 565215,
      "hit_count" : 159661,
      "miss_count" : 405554,
      "cache_size" : 4799,
      "cache_count" : 6093,
      "evictions" : 1294
    },
    "completion" : {
      "size" : "0b",
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 1632,
      "memory" : "191.5mb",
      "memory_in_bytes" : 200901977,
      "terms_memory" : "59.3mb",
      "terms_memory_in_bytes" : 62263478,
      "stored_fields_memory" : "121.5mb",
      "stored_fields_memory_in_bytes" : 127473216,
      "term_vectors_memory" : "0b",
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory" : "4.5kb",
      "norms_memory_in_bytes" : 4608,
      "points_memory" : "6.7mb",
      "points_memory_in_bytes" : 7059619,
      "doc_values_memory" : "3.9mb",
      "doc_values_memory_in_bytes" : 4101056,
      "index_writer_memory" : "87mb",
      "index_writer_memory_in_bytes" : 91306920,
      "version_map_memory" : "1.7mb",
      "version_map_memory_in_bytes" : 1786679,
      "fixed_bit_set" : "551.9kb",
      "fixed_bit_set_memory_in_bytes" : 565160,
      "max_unsafe_auto_id_timestamp" : 1580169603576,
      "file_sizes" : { }
    }
  },
  "nodes" : {
    "count" : {
      "total" : 10,
      "coordinating_only" : 0,
      "data" : 3,
      "ingest" : 4,
      "master" : 3,
      "ml" : 10,
      "voting_only" : 0
    },
    "versions" : [
      "7.4.2"
    ],
    "os" : {
      "available_processors" : 80,
      "allocated_processors" : 80,
      "names" : [
        {
          "name" : "Linux",
          "count" : 10
        }
      ],
      "pretty_names" : [
        {
          "pretty_name" : "Red Hat Enterprise Linux Server 7.4 (Maipo)",
          "count" : 10
        }
      ],
      "mem" : {
        "total" : "312.6gb",
        "total_in_bytes" : 335682314240,
        "free" : "36.6gb",
        "free_in_bytes" : 39326842880,
        "used" : "276gb",
        "used_in_bytes" : 296355471360,
        "free_percent" : 12,
        "used_percent" : 88
      }
    },
    "process" : {
      "cpu" : {
        "percent" : 298
      },
      "open_file_descriptors" : {
        "min" : 480,
        "max" : 885,
        "avg" : 602
      }
    },
    "jvm" : {
      "max_uptime" : "1.1d",
      "max_uptime_in_millis" : 96586237,
      "versions" : [
        {
          "version" : "13.0.1",
          "vm_name" : "OpenJDK 64-Bit Server VM",
          "vm_version" : "13.0.1+9",
          "vm_vendor" : "AdoptOpenJDK",
          "bundled_jdk" : true,
          "using_bundled_jdk" : true,
          "count" : 10
        }
      ],
      "mem" : {
        "heap_used" : "9gb",
        "heap_used_in_bytes" : 9754097888,
        "heap_max" : "33.3gb",
        "heap_max_in_bytes" : 35809918976
      },
      "threads" : 658
    },
    "fs" : {
      "total" : "6tb",
      "total_in_bytes" : 6597069766656,
      "free" : "3.1tb",
      "free_in_bytes" : 3474412273664,
      "available" : "3.1tb",
      "available_in_bytes" : 3474412273664
    },
    "plugins" : [ ],
    "network_types" : {
      "transport_types" : {
        "security4" : 10
      },
      "http_types" : {
        "security4" : 10
      }
    },
    "discovery_types" : {
      "zen" : 10
    },
    "packaging_types" : [
      {
        "flavor" : "default",
        "type" : "tar",
        "count" : 10
      }
    ]
  }
}

I am not sure about the reason behind this error, could it be that adding a shard caused the data nodes to timeout?
Do you think that increasing the data nodes heap size would resolve this issue ?

The thing is it only hapends around the time went we take snapshots.

Any help is appreciated, thank you in advance.

system · February 25, 2020, 4:36pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Collector [node_stats] timed out when collecting data Elasticsearch	2	663	March 25, 2019
[ERROR][o.e.x.m.c.n.NodeStatsCollector] [data02] collector [node_stats] timed out when collecting data Elasticsearch	3	2242	August 7, 2018
[node2] collector [cluster_stats] timed out when collecting data Elasticsearch	3	4756	November 20, 2019
[ERROR][o.e.x.m.c.n.NodeStatsCollector] collector [node_stats] timed out when collecting data Elasticsearch	7	4830	September 6, 2020
Es5.4.3 timed out when collecting data Elasticsearch	3	1610	November 14, 2017

Data Node Error [Timed out when collecting data error]

Related topics