High CPU (cgroup) usage/utilization

Hi.

We have only one node in our cluster (hosted on Elastic cloud, v7.10.0) with Kibana (for filebeat and metricbeat) and everything was working fine until about last week. I'm pretty new to all this, but I've set up alerts for high CPU usage and it's been tripping since the previous week and I have no idea why. The load on our instance shouldn't have changed since the amount of data coming in should be quite constant and shouldn't have changed in at least a month. I've also checked hot threads and long running tasks but there's nothing unusual there either. The only thing that is odd is that Cgroup CPU utilization is above 85% threshold some of the time, but even those patterns are odd.

Here are some graphs for the last 24 hours:


As can be seen on the graphs above, Cgroup CPU utilization jumped up yesterday at around 21:30 and stayed up since. However the non-cgroup CPU utilization seems fine (doesn't exceed 6%).

And a close-up graph for the last 2 hours:

I don't see Kibana being a reason for this as can be seen here - last 24 hours:

Kibana for the last 2 hours:

Here are also some other performance graphs (for the last 24 hours):

And our instances health status:

I've already restarted the cluster with no avail. I've even stopped shipping logs (to the same instance, since we only have one) which seemed to help for a day or so but now high CPU usage is back.

Can anyone help me debug what's causing this? If there's a "rogue" client sending too much data or anything like that, how can I find out what client is doing it?

Ok, so we have about 5 "clients" (filebeat and metricbeat) that are sending data to Elastic cloud. I've turned off all of those clients so no new data was coming to the cloud for the last 4 days. And for whatever reason, I've lost all data but for the last 2 days, even though there was no activity on the node for the last 4 days (wasn't restarted etc...):

Not only that, something is still causing 50% CPU usage for whatever reason:

... even though really not much is happening on the instance:

... and the instance config hasn't changed for the last 13 days:

Any kind of help or insight would be much appreciated.

I have the same issues exactly... did you found something?

I'm still debugging the issue, but what I've done so far is to disable all our metricbeat and filebeat clients, as well as node logs and metrics. With all that disabled, the CPU utilization dropped to around 10% which is not 0%, but some utilization is expected from a running node. I was still getting around 100 index and search requests per 5 minutes and I have no idea where they are coming from (internal stuff?).

If I enabled just node metrics, CPU utilization jumped to around 20-25% (on cluster with just 1 node, 4 GB RAM and 2 GB RAM for Kibana) which seems very high for just collecting some metrics.
If I enabled just node logs, CPU utilization seemed unimpacted (stayed at around 10%).

So I decided to enable node logs and enable 4 of the 6 clients (the last 1 or 2 clients, that weren't enabled yet, produce the most logs, but the same amount of metrics as the others) and the CPU utilization seemed unimpacted (stayed at around 10%).

Then I decided to enable the last 2 clients, so I had back online all the clients but no node metrics (Stack Monitoring). CPU utilization at first (in the morning, when I enabled the last clients) seemed unimpacted, but at around 18:00 it gradually increased to around 20-25%. Search requests stayed at around 100 per 5 minutes, but index requests increased to around 400 per 5 minutes.

These are the current performance graphs (running all but the Stack Monitoring):

Before the upgrade to Elastic 7.10.0 we were paying $36/month (for the same 6 clients setup) for one of the cheapest deploys (2 GB RAM for Elastic and 1 GB for Kibana). That setup had a constant memory pressure at around 75% or even higher sometimes and seemed that it needs some more resources in order to ensure stable operation. That's why I've increased resources 2x on Elastic and Kibana side (4GB RAM for Elastic, 2 GB for Kibana) which now costs us 3x as much at $110/month which is a significant cost increase for such a small deploy.
The deploy was running fine after that (on 7.6.2) but I needed to enable stack monitoring alerts and upgrade to 7.10 seemed like a good solution. At first after the upgrade all was fine, but after a few days or even weeks (don't really remember exactly) I started receiving high CPU utilization alerts without any change on the Elastic/Kibana or our clients side. And after contacting Elastic support, they didn't know how to solve our problem except for allocating even more resources to our cluster which could easily increase our cost by another 2x to about $220/month just to "fix" the problem?

The next step now is to enable additional logging on our clients which will put more pressure on the node/cluster and I need to see how CPU Usage and Memory Pressure behave after that, before I try to (re)enable Stack Monitoring to see what happens to it this time.
I'm also tweaking Index Templates and ILMs in the mean time to see if that has any impact on the node performance.

Oh, and regarding having only 3 days of metrics in my previous post. It turns out that something was keeping only the last 3 metrics' daily indices, but I have no idea what was deleting the older ones, since these indices weren't linked to any ILMs.

You seem to be saying that the CPU usage number is higher than some threshold, but I don't understand how that is manifesting as an actual problem. Can you explain why this matters? Why did you set up alerts on high CPU usage in the first place, and why did you choose the thresholds you chose for those alerts?

The problem of high CPU (at extended times over 100%) is that the node's response times increase significantly. CPU load has increased in that time so much, that we've used all the CPU Credits and were frequently running at 90-100% CPU which even made Kibana slow to respond/browse.

The reason for setting up monitoring alerts was simply to be alerted if there is anything out of the ordinary happening with the node/cluster (which I was soon notified of). I didn't set just the high CPU usage alerts, but others as well (high disk usage etc.). Also, I didn't choose any specific thresholds for the alerts, but what seemed as general ones to me (over 90% CPU and/or disk usage, missing monitoring data...).

Would you be able to share a diagnostics dump or at least tasks and hot threads from a period where the cluster is responding slowly?

Also, approximately how much data are you ingesting into the cluster? (X GB/hour or similar).

It seems like you are aware of CPU credits. I wonder if in the past, the workload have been different such that you did not use up all the credits?

Unfortunately I don't have any such data anymore. I'd have to reproduce the case (probably with the same setup, just enabling Stack Monitoring), but I'm reluctant to do that since I'm not sure what the outcome will be and how that would impact the stability of the node.
However, I did check for hot threads and long running tasks back then and there were no obvious problems/culprits there.

I don't know how to measure the amount of data we're ingesting in the cluster since I know of no such API call (can you help me with that) and I also don't want to start the instance monitoring because I'm afraid I'll overload the instance again. I can only give you the general performance and some logs screenshots.
The current setup is that all of our 6 clients are sending logs to our node (4 of them are sending additional and constant logs now, as before, they barely sent any) and the CPU utilization is mostly around 30%, but it keeps spiking over 100% practically every hour (and I don't know what's causing it or how to find that out).

Performance for the last 24 hours:

Logs metrics for the last 7 days:

I'm not sure what happened to the CPU workload, but it wasn't so high (before the upgrade) that we'd need CPU credits for it. Then, after some time after the upgrade to v7.10.0, CPU usage got so high that we used up the CPU credits in a few hours and weren't able to get them back since our CPU usage was constantly too high.

One significant change between 7.6 and 7.10 is the use of G1 garbage collector. We did some yet unreleased improvements, but those are mostly in a 10% range. It could be interesting to see if we can correlate the spikes to GC. The ES log should indicate if a lot of time is spent on GC.

The amount of data indexed can be found in a few ways, one way is to do _stats two times with 1 minute in between, then we can diff the output.

Also, seeing tasks and hot threads from your current usage might reveal something, I should be willing to take a look even if it is not in the overload situation.

I've made the _stats logs as suggested (about a minute apart). I've also included the hot_threads and _tasks in the logs on this link: https://drive.google.com/file/d/1UDrOOy6Bj1IPnserkJNGUiY-A9whHkfZ

As for the GC spikes. I can't quite check that because some of the Performance graphs stopped working after a few days and I have no idea why. I've enabled the ML and advanced Search instances which added at least one more instance to the Performance graphs.


Hm, the graphs are back now (have been for a week) even though I didn't change any configs/loads on the server or client side.

I did some more researching and maybe found the reason for CPU spikes. It seems one of our clients sending a lot of logs in a short time. Took actions to reduce excessive logging and will see if that helps. If that helps, I might also turn Stack Monitoring back on to see how that affects the CPU utilization.

Thanks for that extra info. I did have a look at the files you sent earlier, but they did not show anything interesting since the cluster was doing nearly nothing at the time.

Look forward to hearing if this behaves well without the excessive logging.

Reducing logging from one of our clients improved the situation, but I still saw CPU spikes occurring about every hour (20 minutes past the hour) that didn't really correlate to our clients' activity. The only thing, that could be related to that (that I could find) is a "cron schedule for retaining snapshots" (under Snapshot and Restore -> Policy) which is set to 0 20,50 * * * ? and can't be changed (I'll open a new topic for that).

However in the last few days, we don't have any such CPU spikes or the ones that do happen, seem to correlate to our clients' activity.

I had another weird bug with the performance graphs, where just the "CPU Usage" and the "Number of Requests" graphs were lagging behind for about 6 hours, but the rest were fine.

This doesn't make sense to me, how can Stack Monitoring take up so much CPU? These are the new performance graphs after I've turned Stack Monitoring back on. Can you spot the moment that happened?

Before enabling monitoring, CPU usage was around 30%, after enabling monitoring it jumped to around 150% :astonished:

This is the GET /_tasks?human output:

{
  "nodes" : {
    "xOYwLFnDQvOre0gOnWQNPg" : {
      "name" : "instance-0000000005",
      "transport_address" : "10.46.24.70:19941",
      "host" : "10.46.24.70",
      "ip" : "10.46.24.70:19941",
      "roles" : [
        "data",
        "data_cold",
        "data_content",
        "data_hot",
        "data_warm",
        "ingest",
        "master",
        "remote_cluster_client",
        "transform"
      ],
      "attributes" : {
        "logical_availability_zone" : "zone-0",
        "server_name" : "instance-0000000005.6a43cf2206664fce853a683f47dfb878",
        "availability_zone" : "westeurope-3",
        "xpack.installed" : "true",
        "instance_configuration" : "azure.data.highio.l32sv2",
        "transform.node" : "true",
        "region" : "unknown-region"
      },
      "tasks" : {
        "xOYwLFnDQvOre0gOnWQNPg:122977" : {
          "node" : "xOYwLFnDQvOre0gOnWQNPg",
          "id" : 122977,
          "type" : "direct",
          "action" : "cluster:monitor/tasks/lists[n]",
          "start_time" : "2021-02-19T10:40:31.710Z",
          "start_time_in_millis" : 1613731231710,
          "running_time" : "298.7micros",
          "running_time_in_nanos" : 298707,
          "cancellable" : false,
          "parent_task_id" : "xOYwLFnDQvOre0gOnWQNPg:122976",
          "headers" : { }
        },
        "xOYwLFnDQvOre0gOnWQNPg:122976" : {
          "node" : "xOYwLFnDQvOre0gOnWQNPg",
          "id" : 122976,
          "type" : "transport",
          "action" : "cluster:monitor/tasks/lists",
          "start_time" : "2021-02-19T10:40:31.710Z",
          "start_time_in_millis" : 1613731231710,
          "running_time" : "613.6micros",
          "running_time_in_nanos" : 613616,
          "cancellable" : false,
          "headers" : { }
        }
      }
    },
    "_yuKoYXYQp-heg7ZjoOi7w" : {
      "name" : "instance-0000000006",
      "transport_address" : "10.46.24.89:19164",
      "host" : "10.46.24.89",
      "ip" : "10.46.24.89:19164",
      "roles" : [
        "ml",
        "remote_cluster_client"
      ],
      "attributes" : {
        "logical_availability_zone" : "zone-0",
        "server_name" : "instance-0000000006.6a43cf2206664fce853a683f47dfb878",
        "availability_zone" : "westeurope-3",
        "ml.machine_memory" : "1073741824",
        "xpack.installed" : "true",
        "instance_configuration" : "azure.ml.d64sv3",
        "transform.node" : "false",
        "ml.max_open_jobs" : "20",
        "region" : "unknown-region"
      },
      "tasks" : {
        "_yuKoYXYQp-heg7ZjoOi7w:71897" : {
          "node" : "_yuKoYXYQp-heg7ZjoOi7w",
          "id" : 71897,
          "type" : "persistent",
          "action" : "xpack/ml/datafeed[c]",
          "start_time" : "2021-02-19T10:18:21.806Z",
          "start_time_in_millis" : 1613729901806,
          "running_time" : "22.1m",
          "running_time_in_nanos" : 1329905208451,
          "cancellable" : true,
          "parent_task_id" : "cluster:49",
          "headers" : { }
        },
        "_yuKoYXYQp-heg7ZjoOi7w:71992" : {
          "node" : "_yuKoYXYQp-heg7ZjoOi7w",
          "id" : 71992,
          "type" : "persistent",
          "action" : "xpack/ml/datafeed[c]",
          "start_time" : "2021-02-19T10:18:23.326Z",
          "start_time_in_millis" : 1613729903326,
          "running_time" : "22.1m",
          "running_time_in_nanos" : 1328385730898,
          "cancellable" : true,
          "parent_task_id" : "cluster:50",
          "headers" : { }
        },
        "_yuKoYXYQp-heg7ZjoOi7w:82202" : {
          "node" : "_yuKoYXYQp-heg7ZjoOi7w",
          "id" : 82202,
          "type" : "transport",
          "action" : "cluster:monitor/tasks/lists[n]",
          "start_time" : "2021-02-19T10:40:31.711Z",
          "start_time_in_millis" : 1613731231711,
          "running_time" : "105.4micros",
          "running_time_in_nanos" : 105400,
          "cancellable" : false,
          "parent_task_id" : "xOYwLFnDQvOre0gOnWQNPg:122976",
          "headers" : { }
        },
        "_yuKoYXYQp-heg7ZjoOi7w:71807" : {
          "node" : "_yuKoYXYQp-heg7ZjoOi7w",
          "id" : 71807,
          "type" : "persistent",
          "action" : "xpack/ml/job[c]",
          "start_time" : "2021-02-19T10:18:19.933Z",
          "start_time_in_millis" : 1613729899933,
          "running_time" : "22.1m",
          "running_time_in_nanos" : 1331778336769,
          "cancellable" : true,
          "parent_task_id" : "cluster:48",
          "headers" : { }
        },
        "_yuKoYXYQp-heg7ZjoOi7w:71806" : {
          "node" : "_yuKoYXYQp-heg7ZjoOi7w",
          "id" : 71806,
          "type" : "persistent",
          "action" : "xpack/ml/job[c]",
          "start_time" : "2021-02-19T10:18:19.933Z",
          "start_time_in_millis" : 1613729899933,
          "running_time" : "22.1m",
          "running_time_in_nanos" : 1331778384769,
          "cancellable" : true,
          "parent_task_id" : "cluster:47",
          "headers" : { }
        }
      }
    }
  }
}

And the GET /_nodes/hot_threads is a hit-and-miss and nothing useful IMO:

::: {instance-0000000006}{_yuKoYXYQp-heg7ZjoOi7w}{edUv8Mt4TqaQNBpS-X8gEQ}{10.46.24.89}{10.46.24.89:19164}{lr}{logical_availability_zone=zone-0, server_name=instance-0000000006.6a43cf2206664fce853a683f47dfb878, availability_zone=westeurope-3, ml.machine_memory=1073741824, xpack.installed=true, instance_configuration=azure.ml.d64sv3, transform.node=false, ml.max_open_jobs=20, region=unknown-region}
   Hot threads at 2021-02-19T10:43:21.522Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

::: {instance-0000000005}{xOYwLFnDQvOre0gOnWQNPg}{axeTq0lATBaEHajePQZbAQ}{10.46.24.70}{10.46.24.70:19941}{cdhimrstw}{logical_availability_zone=zone-0, server_name=instance-0000000005.6a43cf2206664fce853a683f47dfb878, availability_zone=westeurope-3, xpack.installed=true, instance_configuration=azure.data.highio.l32sv2, transform.node=true, region=unknown-region}
   Hot threads at 2021-02-19T10:43:21.521Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
   20.6% (103.2ms out of 500ms) cpu usage by thread 'elasticsearch[instance-0000000005][management][T#5]'
     3/10 snapshots sharing following 18 elements
       app//org.elasticsearch.action.admin.cluster.stats.MappingVisitor.visitMapping(MappingVisitor.java:40)
       app//org.elasticsearch.action.admin.cluster.stats.MappingVisitor.visitMapping(MappingVisitor.java:40)
       app//org.elasticsearch.action.admin.cluster.stats.MappingVisitor.visitMapping(MappingVisitor.java:40)
       app//org.elasticsearch.action.admin.cluster.stats.MappingStats.of(MappingStats.java:58)
       app//org.elasticsearch.action.admin.cluster.stats.ClusterStatsResponse.<init>(ClusterStatsResponse.java:78)
       app//org.elasticsearch.action.admin.cluster.stats.TransportClusterStatsAction.newResponse(TransportClusterStatsAction.java:83)
       app//org.elasticsearch.action.admin.cluster.stats.TransportClusterStatsAction.newResponse(TransportClusterStatsAction.java:54)
       app//org.elasticsearch.action.support.nodes.TransportNodesAction.newResponse(TransportNodesAction.java:137)
       app//org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.lambda$finishHim$1(TransportNodesAction.java:263)
       app//org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction$$Lambda$5146/0x000000080189cba8.get(Unknown Source)
       app//org.elasticsearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:58)
       app//org.elasticsearch.action.ActionRunnable$$Lambda$5147/0x000000080189cdd0.accept(Unknown Source)
       app//org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:73)
       app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:737)
       app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.base@15.0.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
       java.base@15.0.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
       java.base@15.0.1/java.lang.Thread.run(Thread.java:832)
     7/10 snapshots sharing following 15 elements
       app//org.elasticsearch.action.admin.cluster.stats.MappingStats.of(MappingStats.java:58)
       app//org.elasticsearch.action.admin.cluster.stats.ClusterStatsResponse.<init>(ClusterStatsResponse.java:78)
       app//org.elasticsearch.action.admin.cluster.stats.TransportClusterStatsAction.newResponse(TransportClusterStatsAction.java:83)
       app//org.elasticsearch.action.admin.cluster.stats.TransportClusterStatsAction.newResponse(TransportClusterStatsAction.java:54)
       app//org.elasticsearch.action.support.nodes.TransportNodesAction.newResponse(TransportNodesAction.java:137)
       app//org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.lambda$finishHim$1(TransportNodesAction.java:263)
       app//org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction$$Lambda$5146/0x000000080189cba8.get(Unknown Source)
       app//org.elasticsearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:58)
       app//org.elasticsearch.action.ActionRunnable$$Lambda$5147/0x000000080189cdd0.accept(Unknown Source)
       app//org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:73)
       app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:737)
       app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.base@15.0.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
       java.base@15.0.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
       java.base@15.0.1/java.lang.Thread.run(Thread.java:832)

These are the GET /_cluster/settings:

{
  "persistent" : {
    "action" : {
      "auto_create_index" : ".ent-search-*-logs-*,-.ent-search-*,+*"
    },
    "cluster" : {
      "routing" : {
        "allocation" : {
          "disk" : {
            "threshold_enabled" : "true"
          }
        }
      },
      "indices" : {
        "close" : {
          "enable" : "true"
        }
      },
      "metadata" : {
        "display_name" : "koco-robot"
      }
    },
    "slm" : {
      "retention_schedule" : "0 20,50 * * * ?"
    },
    "snapshot" : {
      "max_concurrent_operations" : "1"
    },
    "xpack" : {
      "monitoring" : {
        "collection" : {
          "enabled" : "true",
          "interval" : "60s"
        },
        "exporters" : {
          "__no-default-local__" : {
            "type" : "local",
            "enabled" : "false"
          },
          "found-user-defined" : {
            "type" : "local",
            "enabled" : "true"
          }
        }
      },
      "ml" : {
        "max_model_memory_limit" : "315MB",
        "max_machine_memory_percent" : "41"
      }
    }
  },
  "transient" : {
    "cluster" : {
      "routing" : {
        "allocation" : {
          "enable" : "all"
        }
      }
    }
  }
}

The hot threads you shared suggest it's spending a time computing mapping stats, which could be because you have a lot of indices or your mappings are large. You might see some benefit from an upgrade to 7.11.1 if so:

Also you're indexing monitoring docs back into the local cluster, which adds load too.

Still, given that it's only supposed to collect monitoring data every 60s I don't think that's a very satisfactory explanation. Given that the GC overhead stays at zero, I would expect hot threads to contain the answer if the load was coming from Elasticsearch. Assuming you've run it a lot without seeing anything else of interest, I wonder if it's something else in the environment that gets switched on when monitoring is enabled.

In an earlier post you shared a dashboard containing some node-specific metrics including one called "CPU Utilisation". Does that also jump up at the same time?

And some server logs if it helps:

11:48:14.000 | kibana.log |  | [kibana][access] 10.46.18.5  "post /api/index_management/indices/reload HTTP/" 200 9
11:48:15.000 | kibana.log |  | [kibana][access] 10.46.18.5  "get /api/licensing/info HTTP/" 200 9
11:48:15.354 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(959) Concurrent Mark From Roots 1390.183ms
11:48:15.354 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(959) Concurrent Preclean
11:48:15.355 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(959) Concurrent Preclean 0.884ms
11:48:15.355 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(959) Concurrent Mark (1853.850s, 1855.241s) 1391.170ms
11:48:15.355 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(959) Pause Remark
11:48:15.402 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(959) Pause Remark 976M->976M(1868M) 47.186ms
11:48:15.403 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(959) User=0.09s Sys=0.00s Real=0.04s
11:48:15.403 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] Safepoint "G1Concurrent", Time since last: 1395135259 ns, Reaching safepoint: 403410 ns, At safepoint: 47455803 ns, Total: 47859213 ns
11:48:15.403 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(959) Concurrent Rebuild Remembered Sets
11:48:16.187 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(959) Concurrent Rebuild Remembered Sets 784.304ms
11:48:16.187 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(959) Pause Cleanup
11:48:16.188 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(959) Pause Cleanup 1140M->1140M(1868M) 0.824ms
11:48:16.188 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(959) User=0.00s Sys=0.00s Real=0.00s
11:48:16.188 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] Safepoint "G1Concurrent", Time since last: 784454981 ns, Reaching safepoint: 356209 ns, At safepoint: 983725 ns, Total: 1339934 ns
11:48:16.188 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(959) Concurrent Cleanup for Next Mark
11:48:16.192 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(959) Concurrent Cleanup for Next Mark 3.103ms
11:48:16.192 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(959) Concurrent Cycle 2231.997ms
11:48:17.000 | kibana.log |  | [kibana][access] 172.17.0.61  "get /api/status HTTP/" 200 9
11:48:17.000 | kibana.log |  | [kibana][access] 172.17.0.61  "get /api/stats?extended=true&legacy=true&exclude_usage=true HTTP/" 200 9
11:48:17.000 | kibana.log |  | [kibana][access] 172.17.0.61  "get /api/settings?extended=true&legacy=true HTTP/" 200 9
11:48:17.622 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(960) Pause Young (Concurrent Start) (G1 Humongous Allocation)
11:48:17.622 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(960) Using 2 workers of 2 for evacuation
11:48:17.622 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(960) Desired survivor size 41418752 bytes, new threshold 15 (max threshold 15)
11:48:17.632 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(960) Age table with threshold 15 (max threshold 15)
11:48:17.632 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(960) - age   1:    8591512 bytes,    8591512 total
11:48:17.632 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(960) - age   2:      34808 bytes,    8626320 total
11:48:17.632 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(960) - age   3:     525536 bytes,    9151856 total
11:48:17.632 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(960) - age   4:     127976 bytes,    9279832 total
11:48:17.632 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(960) - age   5:      25808 bytes,    9305640 total
11:48:17.632 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(960) - age   6:       3720 bytes,    9309360 total
11:48:17.632 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(960) - age   7:     117296 bytes,    9426656 total
11:48:17.632 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(960)   Pre Evacuate Collection Set: 0.4ms
11:48:17.632 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(960)   Merge Heap Roots: 0.3ms
11:48:17.632 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(960)   Evacuate Collection Set: 7.3ms
11:48:17.632 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(960)   Post Evacuate Collection Set: 1.5ms
11:48:17.632 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(960)   Other: 0.5ms
11:48:17.632 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(960) Eden regions: 463->0(624)
11:48:17.632 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(960) Survivor regions: 23->10(79)
11:48:17.632 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(960) Old regions: 704->704
11:48:17.632 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(960) Archive regions: 2->2
11:48:17.632 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(960) Humongous regions: 45->15
11:48:17.632 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(960) Metaspace: 141285K(146256K)->141285K(146256K) NonClass: 124474K(127876K)->124474K(127876K) Class: 16811K(18380K)->16811K(18380K)
11:48:17.632 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(960) Pause Young (Concurrent Start) (G1 Humongous Allocation) 1234M->728M(1868M) 10.036ms
11:48:17.632 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(960) User=0.02s Sys=0.00s Real=0.01s
11:48:17.632 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(961) Concurrent Cycle
11:48:17.632 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(961) Concurrent Clear Claimed Marks
11:48:17.632 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] Safepoint "G1TryInitiateConcMark", Time since last: 1433143421 ns, Reaching safepoint: 331008 ns, At safepoint: 10246460 ns, Total: 10577468 ns
11:48:17.632 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(961) Concurrent Clear Claimed Marks 0.217ms
11:48:17.632 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(961) Concurrent Scan Root Regions
11:48:17.636 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(961) Concurrent Scan Root Regions 3.239ms
11:48:17.636 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(961) Concurrent Mark (1857.522s)
11:48:17.636 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(961) Concurrent Mark From Roots
11:48:17.636 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(961) Using 1 workers of 1 for marking
11:48:18.000 | kibana.log |  | [kibana][access] 10.46.24.47  "get / HTTP/" 302 9
11:48:18.838 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(961) Concurrent Mark From Roots 1202.576ms
11:48:18.838 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(961) Concurrent Preclean
11:48:18.839 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(961) Concurrent Preclean 0.926ms
11:48:18.839 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(961) Concurrent Mark (1857.522s, 1858.725s) 1203.577ms
11:48:18.840 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(961) Pause Remark
11:48:18.877 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(961) Pause Remark 839M->839M(1868M) 36.752ms
11:48:18.877 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(961) User=0.07s Sys=0.01s Real=0.04s
11:48:18.877 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] Safepoint "G1Concurrent", Time since last: 1207513202 ns, Reaching safepoint: 251506 ns, At safepoint: 36972637 ns, Total: 37224143 ns
11:48:18.877 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(961) Concurrent Rebuild Remembered Sets
11:48:19.693 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(961) Concurrent Rebuild Remembered Sets 816.159ms
11:48:19.694 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(961) Pause Cleanup
11:48:19.695 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(961) Pause Cleanup 862M->862M(1868M) 1.159ms
11:48:19.695 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(961) User=0.00s Sys=0.00s Real=0.01s
11:48:19.695 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] Safepoint "G1Concurrent", Time since last: 816332488 ns, Reaching safepoint: 255206 ns, At safepoint: 1337734 ns, Total: 1592940 ns
11:48:19.695 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(961) Concurrent Cleanup for Next Mark
11:48:19.699 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(961) Concurrent Cleanup for Next Mark 4.165ms
11:48:19.699 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(961) Concurrent Cycle 2066.912ms
11:48:21.696 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] Safepoint "Cleanup", Time since last: 2000710002 ns, Reaching safepoint: 273907 ns, At safepoint: 17901 ns, Total: 291808 ns
11:48:22.523 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(962) Pause Young (Normal) (G1 Evacuation Pause)
11:48:22.523 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(962) Using 2 workers of 2 for evacuation
11:48:22.523 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(962) Desired survivor size 41943040 bytes, new threshold 15 (max threshold 15)
11:48:22.536 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(962) Age table with threshold 15 (max threshold 15)
11:48:22.536 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(962) - age   1:    7749360 bytes,    7749360 total
11:48:22.536 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(962) - age   2:    5753736 bytes,   13503096 total
11:48:22.536 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(962) - age   3:      14728 bytes,   13517824 total
11:48:22.536 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(962) - age   4:     493232 bytes,   14011056 total
11:48:22.536 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(962) - age   5:     126088 bytes,   14137144 total
11:48:22.536 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(962) - age   6:      21368 bytes,   14158512 total
11:48:22.536 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(962) - age   7:       3024 bytes,   14161536 total
11:48:22.536 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(962) - age   8:     117296 bytes,   14278832 total
11:48:22.536 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(962)   Pre Evacuate Collection Set: 0.2ms
11:48:22.536 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(962)   Merge Heap Roots: 0.4ms
11:48:22.536 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(962)   Evacuate Collection Set: 10.3ms
11:48:22.536 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(962)   Post Evacuate Collection Set: 1.5ms
11:48:22.536 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(962)   Other: 0.5ms
11:48:22.536 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(962) Eden regions: 624->0(623)
11:48:22.536 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(962) Survivor regions: 10->14(80)
11:48:22.536 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(962) Old regions: 704->704
11:48:22.536 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(962) Archive regions: 2->2
11:48:22.536 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(962) Humongous regions: 16->15
11:48:22.536 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(962) Metaspace: 141287K(146256K)->141287K(146256K) NonClass: 124476K(127876K)->124476K(127876K) Class: 16811K(18380K)->16811K(18380K)
11:48:22.536 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(962) Pause Young (Normal) (G1 Evacuation Pause) 1353M->733M(1868M) 12.923ms
11:48:22.536 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(962) User=0.02s Sys=0.00s Real=0.02s
11:48:22.536 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] Safepoint "G1CollectForAllocation", Time since last: 827046558 ns, Reaching safepoint: 260307 ns, At safepoint: 13134133 ns, Total: 13394440 ns
11:48:23.000 | kibana.log |  | [kibana][access] 10.46.24.47  "get / HTTP/" 302 9
11:48:23.703 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(963) Pause Young (Concurrent Start) (G1 Evacuation Pause)
11:48:23.703 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(963) Using 2 workers of 2 for evacuation
11:48:23.703 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(963) Desired survivor size 41943040 bytes, new threshold 15 (max threshold 15)
11:48:23.713 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(963) Age table with threshold 15 (max threshold 15)
11:48:23.714 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(963) - age   1:     861656 bytes,     861656 total
11:48:23.714 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(963) - age   2:    6805064 bytes,    7666720 total
11:48:23.714 | elasticsearch.gc | INFO | [elasticsearch.gc][INFO] GC(963) - age   3:    5749600 bytes,   13416320 total

This is how the logs preview bar (on the right) looked like when Stack Monitoring was on (and then turned off).
image

The hot threads you shared suggest it's spending a time computing mapping stats, which could be because you have a lot of indices or your mappings are large.

Can I turn this off or limit it somehow?

You might see some benefit from an upgrade to 7.11.1 if so:

Hm, thanks, I might upgrade to 7.11.1 to see if it helps. Will post it here when/if I do it.

Still, given that it's only supposed to collect monitoring data every 60s I don't think that's a very satisfactory explanation.

I've manually increased that rate from 10s to 60s to see if it'll influence the CPU usage, but it didn't seem to.

I would expect hot threads to contain the answer if the load was coming from Elasticsearch.

I can post other hot threads output once I try turning on Stack Monitoring again. But from what I've seen running GET /_nodes/hot_threads nothing obvious popped out. Sometimes it returned without any hot threads at all.
I would really like to know what exactly is causing this as well.

In an earlier post you shared a dashboard containing some node-specific metrics including one called "CPU Utilisation". Does that also jump up at the same time?

  • If you're referring to "CPU Utilisation" from the Performance graphs (the screen shot with 8 graphs in boxes) the answer is yes. It goes from ~30% to ~150% when enabling monitoring.
  • If you're referring to "CPU Utilisation" from the Stack Monitoring graphs (I don't have the current ones because I've already disabled Stack Monitoring before I'd use all the CPU credits), I saw that that "CPU Utilisation" remained very low (a few %), but the "Cgroup CPU utilization" was around 40% (in the first post it was more or less always at 100%)

It's part of the stats covered by monitoring, so yes you can turn it off by disabling monitoring.

By way of an experiment, would you try changing to a two-node deployment? I have a hunch there might be a problem that only affects the elected master, but since you only have one node we can't make the necessary comparison.

Sure, I'll try that and let you know (I just hope I'll have the time to try it soon).

Meanwhile, I've noticed new "strange" behaviour of the CPU Usage graphs:

Zoomed in on the time scale:
image

After disabling Stack Monitoring I get consistent CPU spikes, that aren't related to the amount of data our clients send (I've checked that to make sure). The spikes occur every 20th and 50th minute of every hour and 10 minutes after those 2 times as well. Thus I get a CPU spike at every 20th, 30th, 50th and 60th (or 0th, at the start of the hour) minute of the hour. The only thing that could relate to it is the "cron schedule for retaining snapshots" that just can't be set to be running on any other than the default (already mentioned) cron schedule of 0 20,50 * * * ?

A bit of an update. I've updated on v7.11.1 and had some strange CPU Usage behaviour after the update. Prior to the update, I've disabled SLM (because of the before mentioned CPU spikes) with POST /_slm/stop but it seems the update turned that back on as can be seen here (pre and post update is divided by the biggest CPU spike):

After the update I kept the SLM on and it was showing on the CPU Usage graph as spikes:

But after a while, without any change to any configuration, the spikes were gone, and now everything seems to be working as it should:


The spike at 2:30 is the daily snapshot taking spike (which is OK).

I still can't permanently change the cron schedule of 0 20,50 * * * ? because it always reverts back to the default value in a few minutes.

Oh, and I still haven't enabled Stack Monitoring since it took way too much resources. Might try that as well at some later time.