Elasticsearch nodes doing young generation gc very frequently


(Nitish Goyal) #1

Cluster details :
Elasticsearch version : 6.3.0
Java version : 1.8.0_191
54 data nodes
Each BM is split into 2 VMs. Each VM has configuration : 128 GB RAM, 31 GB Heap, 18 cores
3 master nodes

Jvm options

-Xms31744m
-Xmx31744m
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly
-XX:MaxNewSize=16168m 

It's impacting the performance of the cluster badly.

I tried different memory settings for young generation ranging from 1 GB to 16 of GB heap.
With all the settings, I see garbage collection being triggered every sec

[2019-01-10T11:02:25,733][WARN ][o.e.m.j.JvmGcMonitorService] [####] [gc][young][53][6] duration [1s], collections [1]/[1.6s], total [1s]/[2.4s], memory [16.1gb]->[6gb]/[29.4gb], all_pools {[young] [11.6gb]->[617.2mb]/[12.6gb]}{[survivor] [618.9mb]->[1.5gb]/[1.5gb]}{[old] [3.8gb]->[3.8gb]/[15.2gb]}
[2019-01-10T11:02:25,735][WARN ][o.e.m.j.JvmGcMonitorService] [####] [gc][53] overhead, spent [1s] collecting in the last [1.6s]
[2019-01-10T11:02:33,981][WARN ][o.e.m.j.JvmGcMonitorService] [####] [gc][young][59][8] duration [2.6s], collections [2]/[3.2s], total [2.6s]/[5s], memory [16.5gb]->[7.3gb]/[29.4gb], all_pools {[young] [11gb]->[147.3mb]/[12.6gb]}{[survivor] [1.5gb]->[406.2mb]/[1.5gb]}{[old] [3.8gb]->[6.8gb]/[15.2gb]}
[2019-01-10T11:02:33,997][WARN ][o.e.m.j.JvmGcMonitorService] [####] [gc][59] overhead, spent [2.6s] collecting in the last [3.2s]
[2019-01-10T11:02:46,927][INFO ][o.e.m.j.JvmGcMonitorService] [####] [gc][young][71][10] duration [906ms], collections [1]/[1.8s], total [906ms]/[6.1s], memory [16.7gb]->[9.8gb]/[29.4gb], all_pools {[young] [8.4gb]->[103.3mb]/[12.6gb]}{[survivor] [1.3gb]->[1.5gb]/[1.5gb]}{[old] [6.8gb]->[8.1gb]/[15.2gb]}
[2019-01-10T11:02:46,930][INFO ][o.e.m.j.JvmGcMonitorService] [####] [gc][71] overhead, spent [906ms] collecting in the last [1.8s]
[2019-01-10T11:02:58,339][WARN ][o.e.m.j.JvmGcMonitorService] [####] [gc][young][82][11] duration [1.2s], collections [1]/[1.4s], total [1.2s]/[7.4s], memory [21.9gb]->[11gb]/[29.4gb], all_pools {[young] [12.1gb]->[126mb]/[12.6gb]}{[survivor] [1.5gb]->[1.4gb]/[1.5gb]}{[old] [8.1gb]->[9.4gb]/[15.2gb]}
[2019-01-10T11:02:58,341][WARN ][o.e.m.j.JvmGcMonitorService] [####] [gc][82] overhead, spent [1.2s] collecting in the last [1.4s]
[2019-01-10T11:03:13,347][INFO ][o.e.m.j.JvmGcMonitorService] [####] [gc][97] overhead, spent [259ms] collecting in the last [1s]
[2019-01-10T11:03:24,163][WARN ][o.e.m.j.JvmGcMonitorService] [####] [gc][young][107][13] duration [1.5s], collections [1]/[1.8s], total [1.5s]/[9.1s], memory [22.6gb]->[10.9gb]/[29.4gb], all_pools {[young] [12gb]->[81.5mb]/[12.6gb]}{[survivor] [1.1gb]->[915.9mb]/[1.5gb]}{[old] [9.4gb]->[9.9gb]/[15.2gb]}
[2019-01-10T11:03:24,164][WARN ][o.e.m.j.JvmGcMonitorService] [####] [gc][107] overhead, spent [1.5s] collecting in the last [1.8s]
[2019-01-10T11:03:31,384][INFO ][o.e.m.j.JvmGcMonitorService] [####] [gc][114] overhead, spent [399ms] collecting in the last [1.2s]
[2019-01-10T11:04:27,553][WARN ][o.e.m.j.JvmGcMonitorService] [####] [gc][170] overhead, spent [657ms] collecting in the last [1s]
[2019-01-10T11:04:42,564][INFO ][o.e.m.j.JvmGcMonitorService] [####] [gc][185] overhead, spent [273ms] collecting in the last [1s]
[2019-01-10T11:04:50,847][WARN ][o.e.m.j.JvmGcMonitorService] [####] [gc][young][193][19] duration [1s], collections [1]/[1.2s], total [1s]/[11.8s], memory [23gb]->[10.7gb]/[29.4gb], all_pools {[young] [12.5gb]->[248.6mb]/[12.6gb]}{[survivor] [418.2mb]->[468.9mb]/[1.5gb]}{[old] [10gb]->[10gb]/[15.2gb]}
[2019-01-10T11:04:50,851][WARN ][o.e.m.j.JvmGcMonitorService] [####] [gc][193] overhead, spent [1s] collecting in the last [1.2s]
[2019-01-10T11:05:15,877][INFO ][o.e.m.j.JvmGcMonitorService] [####] [gc][218] overhead, spent [322ms] collecting in the last [1s]
[2019-01-10T11:05:44,959][INFO ][o.e.m.j.JvmGcMonitorService] [####] [gc][young][247][22] duration [957ms], collections [1]/[1s], total [957ms]/[13.1s], memory [22.9gb]->[10.6gb]/[29.4gb], all_pools {[young] [12.4gb]->[225.1mb]/[12.6gb]}{[survivor] [437.5mb]->[379.6mb]/[1.5gb]}{[old] [10gb]->[10gb]/[15.2gb]}

Kindly suggest what needs to be fixed for better performance


(Nitish Goyal) #2

@elastic Kindly suggest, our cluster performance has degraded and lag is building up in our ingestion pipelines

@davidkarlsen @Badger I see you guys have faced similar issues in the past. It would be really helpful if you can help us out here

Could this be an issue because of 2 VMs on 1 BM?

Thanks,
Nitish


(Christian Dahlqvist) #3

What is the full output of the cluster stats API?


#4

Overall you appear to spend about 11 seconds in GC over three and half minutes. It's high, but not outrageous. I doubt GC is the centre of your problems.

Do you have a rationale for using -XX:CMSInitiatingOccupancyFraction=75? If not, then remove it. It just means you can only use 3/4 of your heap.


(Mark Walkom) #5

Please don't ping people like that. Most users here are community based volunteers.


(Nitish Goyal) #6

Output of cluster stats API


* _nodes: {
  * total: 62,

  * successful: 62,

  * failed: 0},

* cluster_name: "###",

* timestamp: 1547181459080,

* status: "green",

* indices: {
  * count: 2647,

  * shards: {
    * total: 14178,

    * primaries: 7095,

    * replication: 0.99830866807611,

    * index: {
      * shards: {
        * min: 2,

        * max: 40,

        * avg: 5.356252361163581},

      * primaries: {
        * min: 1,

        * max: 20,

        * avg: 2.680392897619947},

      * replication: {
        * min: 0,

        * max: 1,

        * avg: 0.9977332829618436}}},

  * docs: {
    * count: 45373165921,

    * deleted: 274292654},

  * store: {
    * size_in_bytes: 115342384533402},

  * fielddata: {
    * memory_size_in_bytes: 29848384,

    * evictions: 0},

  * query_cache: {
    * memory_size_in_bytes: 0,

    * total_count: 0,

    * hit_count: 0,

    * miss_count: 0,

    * cache_size: 0,

    * cache_count: 0,

    * evictions: 0},

  * completion: {
    * size_in_bytes: 0},

  * segments: {
    * count: 279800,

    * memory_in_bytes: 344827997472,

    * terms_memory_in_bytes: 324454122049,

    * stored_fields_memory_in_bytes: 10918551272,

    * term_vectors_memory_in_bytes: 0,

    * norms_memory_in_bytes: 2593041856,

    * points_memory_in_bytes: 5337342615,

    * doc_values_memory_in_bytes: 1524939680,

    * index_writer_memory_in_bytes: 4602154631,

    * version_map_memory_in_bytes: 14059924,

    * fixed_bit_set_memory_in_bytes: 0,

    * max_unsafe_auto_id_timestamp: -1,

    * file_sizes: { }}},

* nodes: {
  * count: {
    * total: 62,

    * data: 54,

    * coordinating_only: 0,

    * master: 3,

    * ingest: 62},

  * versions: [
    * "6.3.0"],

  * os: {
    * available_processors: 1144,

    * allocated_processors: 1144,

    * names: [
      * {
        * name: "Linux",

        * count: 62}],

    * mem: {
      * total_in_bytes: 5514092630016,

      * free_in_bytes: 609798819840,

      * used_in_bytes: 4904293810176,

      * free_percent: 11,

      * used_percent: 89}},

  * process: {
    * cpu: {
      * percent: 363},

    * open_file_descriptors: {
      * min: 1635,

      * max: 4105,

      * avg: 3102}},

  * jvm: {
    * max_uptime_in_millis: 11555411240,

    * versions: [
      * {
        * version: "1.8.0_181",

        * vm_name: "Java HotSpot(TM) 64-Bit Server VM",

        * vm_version: "25.181-b13",

        * vm_vendor: "Oracle Corporation",

        * count: 8},

      * {
        * version: "1.8.0_191",

        * vm_name: "Java HotSpot(TM) 64-Bit Server VM",

        * vm_version: "25.191-b12",

        * vm_vendor: "Oracle Corporation",

        * count: 54}],

    * mem: {
      * heap_used_in_bytes: 744203944496,

      * heap_max_in_bytes: 1928486518784},

    * threads: 11634},

  * fs: {
    * total_in_bytes: 345595116142592,

    * free_in_bytes: 230171333484544,

    * available_in_bytes: 230171333484544},

  * plugins: [ ],

  * network_types: {
    * transport_types: {
      * security4: 62},

    * http_types: {
      * security4: 62}}}

}


(Nitish Goyal) #8

@elastic


(system) closed #9

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.