GC on elastic search data nodes and node automatically reconnect from the cluster

Hi Team,

we are getting the below alerts in elasticsearch data nodes.

[2022-11-20T21:27:36,237][INFO ][o.e.m.j.JvmGcMonitorService] [esnode1] [gc][438806] overhead, spent [276ms] collecting in the last [1s]
[2022-11-20T21:34:55,808][INFO ][o.e.m.j.JvmGcMonitorService] [esnode1] [gc][439245] overhead, spent [274ms] collecting in the last [1s]
[2022-11-20T21:35:07,737][WARN ][o.e.m.j.JvmGcMonitorService] [esnode1] [gc][439254] overhead, spent [3.2s] collecting in the last [3.9s]
[2022-11-20T21:35:44,081][INFO ][o.e.c.c.Coordinator ] [esnode1] master node [{esnodem}{FgsefzjQSIeQEU6MiHstoA}{DWhI4XKsTzWsQK5eWT53ew}{ip}{ip:port}{m}{xpack.installed=true}] failed, restarting discovery

After that above logs prints in the log file, same node gets disconnected from cluster and rejoins after some minutes.

Jdk: jdk1.8.0_131
Elasticsearch version: 7.3.2
Heap:11GB

GC :CMS

-XX:+UseParNewGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+UseCMSInitiatingOccupancyOnly

What is the output from the _cluster/stats?pretty&human API?

HI @warkolm

Thanks for the look a that this topic. here myself attached the _cluster/stats?pretty&human


{
  "_nodes": {
    "total": 8,
    "successful": 8,
    "failed": 0
  },
  "cluster_name": "cluster-name",
  "cluster_uuid": "4XqywN51RUa3oD9PlmD6vA",
  "timestamp": 1669107605469,
  "status": "green",
  "indices": {
    "count": 455,
    "shards": {
      "total": 4526,
      "primaries": 2263,
      "replication": 1.0,
      "index": {
        "shards": {
          "min": 2,
          "max": 10,
          "avg": 9.947252747252747
        },
        "primaries": {
          "min": 1,
          "max": 5,
          "avg": 4.973626373626374
        },
        "replication": {
          "min": 1.0,
          "max": 1.0,
          "avg": 1.0
        }
      }
    },
    "docs": {
      "count": 8181393,
      "deleted": 212291
    },
    "store": {
      "size_in_bytes": 19787145460
    },
    "fielddata": {
      "memory_size_in_bytes": 0,
      "evictions": 0
    },
    "query_cache": {
      "memory_size_in_bytes": 0,
      "total_count": 443874,
      "hit_count": 90419,
      "miss_count": 353455,
      "cache_size": 0,
      "cache_count": 29592,
      "evictions": 29592
    },
    "completion": {
      "size_in_bytes": 0
    },
    "segments": {
      "count": 4736,
      "memory_in_bytes": 32822900,
      "terms_memory_in_bytes": 24083286,
      "stored_fields_memory_in_bytes": 5103888,
      "term_vectors_memory_in_bytes": 0,
      "norms_memory_in_bytes": 437888,
      "points_memory_in_bytes": 302760,
      "doc_values_memory_in_bytes": 2895078,
      "index_writer_memory_in_bytes": 0,
      "version_map_memory_in_bytes": 120,
      "fixed_bit_set_memory_in_bytes": 0,
      "max_unsafe_auto_id_timestamp": 1669018583631,
      "file_sizes": {
        
      }
    }
  },
  "nodes": {
    "count": {
      "total": 8,
      "coordinating_only": 0,
      "data": 5,
      "ingest": 0,
      "master": 3,
      "voting_only": 0
    },
    "versions": [
      "7.3.2"
    ],
    "os": {
      "available_processors": 32,
      "allocated_processors": 32,
      "names": [
        {
          "name": "Linux",
          "count": 8
        }
      ],
      "pretty_names": [
        {
          "pretty_name": "Ubuntu 18.04.6 LTS",
          "count": 8
        }
      ],
      "mem": {
        "total_in_bytes": 269358170112,
        "free_in_bytes": 76616896512,
        "used_in_bytes": 192741273600,
        "free_percent": 28,
        "used_percent": 72
      }
    },
    "process": {
      "cpu": {
        "percent": 12
      },
      "open_file_descriptors": {
        "min": 440,
        "max": 11238,
        "avg": 4853
      }
    },
    "jvm": {
      "max_uptime_in_millis": 2789136548,
      "versions": [
        {
          "version": "1.8.0_131",
          "vm_name": "Java HotSpot(TM) 64-Bit Server VM",
          "vm_version": "25.131-b11",
          "vm_vendor": "Oracle Corporation",
          "bundled_jdk": true,
          "using_bundled_jdk": false,
          "count": 8
        }
      ],
      "mem": {
        "heap_used_in_bytes": 27455664608,
        "heap_max_in_bytes": 72487534592
      },
      "threads": 560
    },
    "fs": {
      "total_in_bytes": 786984878080,
      "free_in_bytes": 756962906112,
      "available_in_bytes": 756879020032
    },
    "plugins": [
      {
        "name": "repository-gcs",
        "version": "7.3.2",
        "elasticsearch_version": "7.3.2",
        "java_version": "1.8",
        "description": "The GCS repository plugin adds Google Cloud Storage support for repositories.",
        "classname": "org.elasticsearch.repositories.gcs.GoogleCloudStoragePlugin",
        "extended_plugins": [
          
        ],
        "has_native_controller": false
      }
    ],
    "network_types": {
      "transport_types": {
        "security4": 8
      },
      "http_types": {
        "security4": 8
      }
    },
    "discovery_types": {
      "zen": 8
    },
    "packaging_types": [
      {
        "flavor": "default",
        "type": "tar",
        "count": 8
      }
    ]
  }
}```

You have too many shards for the number of nodes and the amount of data you are running, and you are running a very very old version of Elasticsearch.

You need to upgrade Elasticsearch urgently, and then look at reducing your shard count.

@warkolm
Thanks for the concern ,we are planning to move to new ES version,

As of now,we have same type of setup in some clusters we don't have issues.

some clusters alone we are facing this issue and continously GC logs are printing.

2022-12-04T20:02:44.962-0800: 20803142.235: Total time for which application threads were stopped: 0.0423853 seconds, Stopping threads took: 0.0001724 seconds
2022-12-04T20:03:40.002-0800: 20803197.274: [GC (Allocation Failure) 2022-12-04T20:03:40.002-0800: 20803197.275: [ParNew
Desired survivor size 134217728 bytes, new threshold 6 (max 6)

  • age 1: 10510080 bytes, 10510080 total
  • age 2: 3311576 bytes, 13821656 total
  • age 3: 2423584 bytes, 16245240 total
  • age 4: 1224248 bytes, 17469488 total
  • age 5: 7008472 bytes, 24477960 total
  • age 6: 6925464 bytes, 31403424 total
    : 1613859K->42420K(1835008K), 0.0459758 secs] 4549630K->2979818K(6029312K), 0.0462850 secs] [Times: user=0.22 sys=0.03, real=0.04 secs]
    2022-12-04T20:03:40.048-0800: 20803197.321: Total time for which application threads were stopped: 0.0493795 seconds, Stopping threads took: 0.0001277 seconds
    2022-12-04T20:03:40.051-0800: 20803197.324: [GC (CMS Initial Mark) [1 CMS-initial-mark: 2937398K(4194304K)] 2982242K(6029312K), 0.0345291 secs] [Times: user=0.11 sys=0.02, real=0.03 secs]
    2022-12-04T20:03:40.086-0800: 20803197.359: Total time for which application threads were stopped: 0.0375606 seconds, Stopping threads took: 0.0002171 seconds
    2022-12-04T20:03:40.086-0800: 20803197.359: [CMS-concurrent-mark-start]
    2022-12-04T20:03:42.023-0800: 20803199.296: [CMS-concurrent-mark: 1.937/1.937 secs] [Times: user=5.28 sys=0.86, real=1.94 secs]
    2022-12-04T20:03:42.023-0800: 20803199.296: [CMS-concurrent-preclean-start]
    2022-12-04T20:03:42.157-0800: 20803199.430: [CMS-concurrent-preclean: 0.133/0.134 secs] [Times: user=0.15 sys=0.02, real=0.13 secs]
    2022-12-04T20:03:42.157-0800: 20803199.430: [CMS-concurrent-abortable-preclean-start]
    CMS: abort preclean due to time 2022-12-04T20:03:47.408-0800: 20803204.681: [CMS-concurrent-abortable-preclean: 5.249/5.251 secs] [Times: user=5.24 sys=0.96, real=5.25 secs]
    2022-12-04T20:03:47.412-0800: 20803204.684: [GC (CMS Final Remark) [YG occupancy: 297067 K (1835008 K)]2022-12-04T20:03:47.412-0800: 20803204.684: [Rescan (parallel) , 0.0308290 secs]2022-12-04T20:03:47.443-0800: 20803204.715: [weak refs processing, 2.2460156 secs]2022-12-04T20:03:49.689-0800: 20803206.961: [class unloading, 0.6939705 secs]2022-12-04T20:03:50.383-0800: 20803207.655: [scrub symbol table, 0.1351316 secs]2022-12-04T20:03:50.518-0800: 20803207.790: [scrub string table, 0.0019429 secs][1 CMS-remark: 2937398K(4194304K)] 3234465K(6029312K), 3.2012414 secs] [Times: user=2.97 sys=0.46, real=3.20 secs]
    2022-12-04T20:03:50.613-0800: 20803207.886: Total time for which application threads were stopped: 3.2039811 seconds, Stopping threads took: 0.0001138 seconds
    2022-12-04T20:03:50.613-0800: 20803207.886: [CMS-concurrent-sweep-start]
    2022-12-04T20:03:56.189-0800: 20803213.462: [CMS-concurrent-sweep: 5.576/5.576 secs] [Times: user=10.89 sys=1.76, real=5.57 secs]
    2022-12-04T20:03:56.189-0800: 20803213.462: [CMS-concurrent-reset-start]
    2022-12-04T20:03:56.251-0800: 20803213.524: [CMS-concurrent-reset: 0.062/0.062 secs] [Times: user=0.11 sys=0.02, real=0.07 secs]
    2022-12-04T20:04:07.619-0800: 20803224.892: Total time for which application threads were stopped: 0.0035519 seconds, Stopping threads took: 0.0002065 seconds
    2022-12-04T20:04:12.282-0800: 20803229.555: [GC (Allocation Failure) 2022-12-04T20:04:12.282-0800: 20803229.555: [ParNew
    Desired survivor size 134217728 bytes, new threshold 6 (max 6)

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.