Change GC log thresholds

programagor · November 2, 2018, 8:28am

Greetings
I am seeing a lot of lines in my Elasticsearch logs which look like this:

[2018-11-02T08:00:12,056][INFO ][o.e.m.j.JvmGcMonitorService] [vie01a-clog-pesd-pe05] [gc][24] overhead, spent [277ms] collecting in the last [1s]
[2018-11-02T08:00:25,212][INFO ][o.e.m.j.JvmGcMonitorService] [vie01a-clog-pesd-pe05] [gc][37] overhead, spent [355ms] collecting in the last [1.1s]
[2018-11-02T08:01:12,534][INFO ][o.e.m.j.JvmGcMonitorService] [vie01a-clog-pesd-pe05] [gc][84] overhead, spent [374ms] collecting in the last [1.2s]
[2018-11-02T08:01:42,570][INFO ][o.e.m.j.JvmGcMonitorService] [vie01a-clog-pesd-pe05] [gc][114] overhead, spent [432ms] collecting in the last [1s]
[2018-11-02T08:01:58,697][INFO ][o.e.m.j.JvmGcMonitorService] [vie01a-clog-pesd-pe05] [gc][130] overhead, spent [304ms] collecting in the last [1.1s]
[2018-11-02T08:02:15,709][INFO ][o.e.m.j.JvmGcMonitorService] [vie01a-clog-pesd-pe05] [gc][147] overhead, spent [337ms] collecting in the last [1s]
[2018-11-02T08:02:28,096][INFO ][o.e.m.j.JvmGcMonitorService] [vie01a-clog-pesd-pe05] [gc][159] overhead, spent [431ms] collecting in the last [1.3s]
[2018-11-02T08:03:21,184][INFO ][o.e.m.j.JvmGcMonitorService] [vie01a-clog-pesd-pe05] [gc][212] overhead, spent [426ms] collecting in the last [1s]
[2018-11-02T08:14:45,859][INFO ][o.e.m.j.JvmGcMonitorService] [vie01a-clog-pesd-pe05] [gc][895] overhead, spent [590ms] collecting in the last [1.6s]
[2018-11-02T08:17:35,096][WARN ][o.e.m.j.JvmGcMonitorService] [vie01a-clog-pesd-pe05] [gc][1064] overhead, spent [643ms] collecting in the last [1s]

My cluster is ingesting about 5000 events per second. It has 6 nodes, each has 96 GB of ram, where 30 GB is dedicated for heap (-Xms30g and -Xmx30g).

I see no adverse impact on the performance, and I would like to increase the logging threshold from INFO to WARN, so that the garbage collector doesn't spam my log files.

I think the command will look something like

PUT /_cluster/settings
{
  "persistent": {
    "logger.org.elasticsearch.something.something.JvmGcMonitorService": "WARN"
  }
}

However, I do not know the exact logger path, since in the logs it is abbreviated to o.e.m.j.JvmGcMonitorService, and I can't find the relevant documentation.

What logger path should I use? Alternatively, how should I address this issue?

programagor · November 2, 2018, 2:00pm

Found it, it was

logger.org.elasticsearch.monitor.jvm.JvmGcMonitorService

Now, where can I adjust the WARN thresholds? I.e., how ho I set that a WARN should be issued only when GC takes more than, say, 750ms in the last 1s?

DavidTurner · November 2, 2018, 4:46pm

Looking at the source of org.elasticsearch.monitor.jvm.JvmGcMonitorService, I think this setting will warn you if GC is taking more than 75% of the time (i.e. 750ms out of the last second). I haven't tried this, I just looked at the code, so it'd be good to know if it works or not.

monitor.jvm.gc.overhead.warn: 75

programagor · November 5, 2018, 10:55am

Thanks for the suggestion. I tried these two ways, neither of which worked:

PUT /_cluster/settings
{
  "persistent": {
    "logger.org.elasticsearch.monitor.jvm.gc.overhead.warn": "90"
  }
}

Response:

{
  "error": {
    "root_cause": [
      {
        "type": "remote_transport_exception",
        "reason": "[vie01a-clog-pesm-pe01][xxx.xxx.xxx.xxx:9300][cluster:admin/settings/update]"
      }
    ],
    "type": "illegal_argument_exception",
    "reason": "Unknown level constant [90]."
  },
  "status": 400
}

If I enter "WARN" as the value, it gets accepted, but setting this isn't useful.

The second way I tried was this:

PUT /_cluster/settings
{
  "persistent": {
    "monitor.jvm.gc.overhead.warn": 90
  }
}

Response:

{
  "error": {
    "root_cause": [
      {
        "type": "remote_transport_exception",
        "reason": "[vie01a-clog-pesm-pe01][xxx.xxx.xxx.xxx:9300][cluster:admin/settings/update]"
      }
    ],
    "type": "illegal_argument_exception",
    "reason": "persistent setting [monitor.jvm.gc.overhead.warn], not dynamically updateable"
  },
  "status": 400
}

This suggest that I'd need to enter this line into some config file.

So, what I did, was adding the following line in elasticsearch.yml:

monitor.jvm.gc.overhead.warn: 90

This setting seems to get accepted, I will report back once it logs for a bit.

DavidTurner · November 7, 2018, 9:18am

programagor:

This suggest that I'd need to enter this line into some config file.

So, what I did, was adding the following line in elasticsearch.yml :
monitor.jvm.gc.overhead.warn: 90
This setting seems to get accepted, I will report back once it logs for a bit.

Apologies for not clarifying how to set that setting, and thanks for describing your sleuthing towards the right answer! Yes, it's not a dynamic setting so adding the line I quoted to elasticsearch.yml is the thing to do.

programagor · November 8, 2018, 10:47am

After a bit of logging, I can confirm that setting the threshold this way works. Here is a sample of one of my logs currently:

[2018-11-08T07:15:20,652][WARN ][o.e.m.j.JvmGcMonitorService] [vie01a-clog-pesd-pe06] [gc][young][231336][53246] duration [1.1s], collections [1]/[1.2s], total [1.1s]/[1.3h], memory [11.1gb]->[10.4gb]/[29.8gb], all_pools {[young] [893.3mb]->[1.5mb]/[931.8mb]}{[survivor] [116.4mb]->[116.4mb]/[116.4mb]}{[old] [10.1gb]->[10.3gb]/[28.8gb]}
[2018-11-08T07:15:20,652][WARN ][o.e.m.j.JvmGcMonitorService] [vie01a-clog-pesd-pe06] [gc][231336] overhead, spent [1.1s] collecting in the last [1.2s]
[2018-11-08T08:15:12,334][WARN ][o.e.m.j.JvmGcMonitorService] [vie01a-clog-pesd-pe06] [gc][young][234915][53987] duration [1s], collections [1]/[1.6s], total [1s]/[1.3h], memory [14.8gb]->[14.4gb]/[29.8gb], all_pools {[young] [726.2mb]->[45.6mb]/[931.8mb]}{[survivor] [116.4mb]->[116.4mb]/[116.4mb]}{[old] [14gb]->[14.2gb]/[28.8gb]}
[2018-11-08T09:15:13,184][WARN ][o.e.m.j.JvmGcMonitorService] [vie01a-clog-pesd-pe06] [gc][young][238499][54854] duration [1s], collections [1]/[1.5s], total [1s]/[1.4h], memory [7.1gb]->[6.7gb]/[29.8gb], all_pools {[young] [713.3mb]->[316.7kb]/[931.8mb]}{[survivor] [116.4mb]->[116.4mb]/[116.4mb]}{[old] [6.3gb]->[6.6gb]/[28.8gb]}

Now, when an actual issue happens, I won't have to grep -v "JvmGcMonitorService" just to see the relevant messages over the GC clutter.

system · December 6, 2018, 10:47am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.