Elastic nodes goes down as gc kicks in

dhruvpsaru · December 19, 2017, 2:55am

ES Version: 5.4.1
OS: Ubuntu
Java:1.8

Issue

We currently upgraded our systems from es 2.3 to es5.4.1, everything was going well till we deployed it on production. It started with es node going for some time randomly. On investigation, we saw under logs in es

[2017-12-15T20:33:28,459][INFO ][o.e.m.j.JvmGcMonitorService] [vzvV7bX] [gc][1723] overhead, spent [252ms] collecting in the last [1s]
[2017-12-15T20:42:53,599][INFO ][o.e.m.j.JvmGcMonitorService] [vzvV7bX] [gc][2288] overhead, spent [288ms] collecting in the last [1s]
[2017-12-15T20:53:55,846][INFO ][o.e.m.j.JvmGcMonitorService] [vzvV7bX] [gc][2950] overhead, spent [313ms] collecting in the last [1s]
[2017-12-15T21:03:18,145][INFO ][o.e.m.j.JvmGcMonitorService] [vzvV7bX] [gc][3512] overhead, spent [325ms] collecting in the last [1s]
[2017-12-15T21:45:45,815][INFO ][o.e.m.j.JvmGcMonitorService] [vzvV7bX] [gc][6059] overhead, spent [340ms] collecting in the last [1s]
[2017-12-15T21:46:39,834][INFO ][o.e.m.j.JvmGcMonitorService] [vzvV7bX] [gc][6113] overhead, spent [445ms] collecting in the last [1s]
[2017-12-15T21:46:59,895][INFO ][o.e.m.j.JvmGcMonitorService] [vzvV7bX] [gc][6133] overhead, spent [281ms] collecting in the last [1s]
[2017-12-15T21:47:27,901][INFO ][o.e.m.j.JvmGcMonitorService] [vzvV7bX] [gc][6161] overhead, spent [306ms] collecting in the last [1s]
[2017-12-15T21:51:56,093][INFO ][o.e.m.j.JvmGcMonitorService] [vzvV7bX] [gc][6429] overhead, spent [297ms] collecting in the last [1.1s]
[2017-12-15T21:51:57,094][INFO ][o.e.m.j.JvmGcMonitorService] [vzvV7bX] [gc][6430] overhead, spent [353ms] collecting in the last [1s]
[2017-12-15T21:57:56,361][INFO ][o.e.m.j.JvmGcMonitorService] [vzvV7bX] [gc][6789] overhead, spent [398ms] collecting in the last [1.1s]
[2017-12-15T21:58:23,407][INFO ][o.e.m.j.JvmGcMonitorService] [vzvV7bX] [gc][6816] overhead, spent [284ms] collecting in the last [1s]
[2017-12-15T21:58:24,444][INFO ][o.e.m.j.JvmGcMonitorService] [vzvV7bX] [gc][6817] overhead, spent [341ms] collecting in the last [1s]
[2017-12-15T22:01:29,536][INFO ][o.e.m.j.JvmGcMonitorService] [vzvV7bX] [gc][7002] overhead, spent [341ms] collecting in the last [1s]
[2017-12-15T22:01:31,663][INFO ][o.e.m.j.JvmGcMonitorService] [vzvV7bX] [gc][7004] overhead, spent [399ms] collecting in the last [1.1s]
[2017-12-15T22:01:35,664][INFO ][o.e.m.j.JvmGcMonitorService] [vzvV7bX] [gc][7008] overhead, spent [273ms] collecting in the last [1s]
[2017-12-15T22:03:08,779][INFO ][o.e.m.j.JvmGcMonitorService] [vzvV7bX] [gc][7101] overhead, spent [281ms] collecting in the last [1s]
[2017-12-15T22:05:54,928][INFO ][o.e.m.j.JvmGcMonitorService] [vzvV7bX] [gc][7267] overhead, spent [310ms] collecting in the last [1.1s]
[2017-12-15T22:05:55,928][INFO ][o.e.m.j.JvmGcMonitorService] [vzvV7bX] [gc][7268] overhead, spent [305ms] collecting in the last [1s]
[2017-12-15T22:05:58,929][INFO ][o.e.m.j.JvmGcMonitorService] [vzvV7bX] [gc][7271] overhead, spent [287ms] collecting in the last [1s]

Earlier the ES 2.3 used to work with 8gb memory on the system with 32 GB ram. We increased the heap space from 8 to 12 and then we increased system ram to 64 GB and give 32 GB heap to elastic search but the es nodes keep going down when there is any heavy query. Please note that earlier everything was working fine.

For 12 GB the details are as under

  "jvm" : {
        "timestamp" : 1513422077508,
        "uptime_in_millis" : 16750330,
        "mem" : {
          "heap_used_in_bytes" : 9402202880,
          "heap_used_percent" : 73,
          "heap_committed_in_bytes" : 12815171584,
          "heap_max_in_bytes" : 12815171584,
          "non_heap_used_in_bytes" : 97693480,
          "non_heap_committed_in_bytes" : 103424000,
          "pools" : {
            "young" : {
              "used_in_bytes" : 441150480,
              "max_in_bytes" : 558432256,
              "peak_used_in_bytes" : 558432256,
              "peak_max_in_bytes" : 558432256
            },
            "survivor" : {
              "used_in_bytes" : 42550952,
              "max_in_bytes" : 69730304,
              "peak_used_in_bytes" : 69730304,
              "peak_max_in_bytes" : 69730304
            },
            "old" : {
              "used_in_bytes" : 8918501448,
              "max_in_bytes" : 12187009024,
              "peak_used_in_bytes" : 8918501448,
              "peak_max_in_bytes" : 12187009024
            }
          }
        },
        "threads" : {
          "count" : 71,
          "peak_count" : 95
        },
        "gc" : {
          "collectors" : {
            "young" : {
              "collection_count" : 104,
              "collection_time_in_millis" : 7078
            },
            "old" : {
              "collection_count" : 1,
              "collection_time_in_millis" : 170
            }
          }
        },
        "buffer_pools" : {
          "direct" : {
            "count" : 67,
            "used_in_bytes" : 847203522,
            "total_capacity_in_bytes" : 847203521
          },
          "mapped" : {
            "count" : 1204,
            "used_in_bytes" : 16787345787,
            "total_capacity_in_bytes" : 16787345787
          }
        },
        "classes" : {
          "current_loaded_count" : 11025,
          "total_loaded_count" : 11025,
          "total_unloaded_count" : 0
        }
      }
    }

But till right now we are not able to find any concrete soln to the problem

Created an issue here https://github.com/elastic/elasticsearch/issues/27872 but it was closed

zqc0512 · December 19, 2017, 3:10am

use supervicstl about it..can auto start.when is down..

dhruvpsaru · December 19, 2017, 3:21am

it recovers automatically in a second, but the es client says node down for that one second if any other query is fired

nithinreddyt · December 19, 2017, 1:10pm

I can see that it's because of the huge heap size. The GC time is too high.

Since elasticsearch now uses doc values for all aggregations except analyzed strings, you can try reducing the heap size (check if you do aggregations on analyzed strings before doing this).

dhruvpsaru · December 21, 2017, 7:17am

Thanks for suggestion, now I reduced java -Xms to 8GB, now the stats are nodes are still going down

 "jvm" : {
          "timestamp" : 1513839849384,
          "uptime_in_millis" : 12440740,
          "mem" : {
            "heap_used_in_bytes" : 2964483576,
            "heap_used_percent" : 34,
            "heap_committed_in_bytes" : 6240751616,
            "heap_max_in_bytes" : 8520204288,
            "non_heap_used_in_bytes" : 102879720,
            "non_heap_committed_in_bytes" : 108969984,
            "pools" : {
              "young" : {
                "used_in_bytes" : 451090984,
                "max_in_bytes" : 558432256,
                "peak_used_in_bytes" : 558432256,
                "peak_max_in_bytes" : 558432256
              },
              "survivor" : {
                "used_in_bytes" : 16900480,
                "max_in_bytes" : 69730304,
                "peak_used_in_bytes" : 69730304,
                "peak_max_in_bytes" : 69730304
              },
              "old" : {
                "used_in_bytes" : 2496492112,
                "max_in_bytes" : 7892041728,
                "peak_used_in_bytes" : 5608349440,
                "peak_max_in_bytes" : 7892041728
              }
            }
          },
          "threads" : {
            "count" : 88,
            "peak_count" : 99
          },
          "gc" : {
            "collectors" : {
              "young" : {
                "collection_count" : 86,
                "collection_time_in_millis" : 14592
              },
              "old" : {
                "collection_count" : 11,
                "collection_time_in_millis" : 780
              }
            }
          },
          "buffer_pools" : {
            "direct" : {
              "count" : 57,
              "used_in_bytes" : 795161001,
              "total_capacity_in_bytes" : 795161000
            },
            "mapped" : {
              "count" : 1080,
              "used_in_bytes" : 12730171248,
              "total_capacity_in_bytes" : 12730171248
            }
          },
          "classes" : {
            "current_loaded_count" : 11264,
            "total_loaded_count" : 11388,
            "total_unloaded_count" : 124
          }
        }

rugenl · January 11, 2018, 4:13am

Did you ever find a resolution? I think I have something similar after upgrading from 5.5.1 to 5.6.5 then to 6.1.1. It seems to be triggered by a pretty large aggregation, but it ran without killing the nodes on 5.x.

system · February 8, 2018, 4:13am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.