JVM heap spikes on data node

radovan · April 21, 2021, 4:36pm

Hello,

I'm not really educated in this topic, but when I first saw this log in our logstash:

[2021-04-15T14:11:47,605][INFO ][logstash.outputs.elasticsearch][beats-monitoring-crypto][3228c43cca49b666651ceb4c7945f188f53b5e48f0efb9e5f9627ce96d04de35] retrying failed action with response code: 429 ({"type"=>"circuit_breaking_exception", "reason"=>"[parent] Data too large, data for [indices:data/write/bulk[s]] would be [7914649094/7.3gb], which is larger than the limit of [7820751667/7.2gb], real usage: [7914556632/7.3gb], new bytes reserved: [92462/90.2kb], usages [request=9184/8.9kb, fielddata=189099/184.6kb, in_flight_requests=24247630/23.1mb, model_inference=0/0b, accounting=46982278/44.8mb]", "bytes_wanted"=>7914649094, "bytes_limit"=>7820751667, "durability"=>"PERMANENT"})

I went through a few questions on Stackoverflow and here too, where it basically said to increase the heap size, so I did. It didn't took too much time and it appeared again just with different values:

retrying failed action with response code: 429 ({"type"=>"circuit_breaking_exception", "reason"=>"[parent] Data too large, data for [indices:data/write/bulk[s]] would be [10025661796/9.3gb], which is larger than the limit of [9775690547/9.1gb], real usage: [10025415888/9.3gb], new bytes reserved: [245908/240.1kb], usages [request=0/0b, fielddata=13538/13.2kb, in_flight_requests=245908/240.1kb, model_inference=0/0b, accounting=48939656/46.6mb]", "bytes_wanted"=>10025661796, "bytes_limit"=>9775690547, "durability"=>"PERMANENT"})

I went through docs and read something about breakers and queue sizes, that's actually how I later discovered, that the problem is regarding only about one data node, not the whole cluster. We have 3 hot nodes at the moment, each with same ubuntu version (18.04.5 LTS), same JDK bundled version (1.8.0_282) same jvm configuration:

-Xms10g
-Xmx10g
8-13:-XX:+UseConcMarkSweepGC
8-13:-XX:CMSInitiatingOccupancyFraction=75
8-13:-XX:+UseCMSInitiatingOccupancyOnly
-Des.networkaddress.cache.ttl=60
-Des.networkaddress.cache.negative.ttl=10
-XX:+AlwaysPreTouch
-Xss1m
-Djava.awt.headless=true
-Dfile.encoding=UTF-8
-Djna.nosys=true
-XX:-OmitStackTraceInFastThrow
-Dio.netty.noUnsafe=true
-Dio.netty.noKeySetOptimization=true
-Dio.netty.recycler.maxCapacityPerThread=0
-Dlog4j.shutdownHookEnabled=false
-Dlog4j2.disable.jmx=true
-Djava.io.tmpdir=${ES_TMPDIR}
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/data/els
-XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log
8:-XX:+PrintGCDetails
8:-XX:+PrintGCDateStamps
8:-XX:+PrintTenuringDistribution
8:-XX:+PrintGCApplicationStoppedTime
8:-Xloggc:/var/log/elasticsearch/gc.log
8:-XX:+UseGCLogFileRotation
8:-XX:NumberOfGCLogFiles=32
8:-XX:GCLogFileSize=64m
9-:-Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m
9-:-Djava.locale.providers=COMPAT
10-:-XX:UseAVX=2

So increasing heap size didn't help, then I tried to use G1GC instead of CMS just like this:

8:-XX:+UseG1GC

and then also with these:

8:-XX:-UseConcMarkSweepGC
8:-XX:-UseCMSInitiatingOccupancyOnly
8:-XX:+UseG1GC
8:-XX:InitiatingHeapOccupancyPercent=30
8:-XX:G1ReservePercent=25

After restarting the node the parent breaker haven't appearead for a few hours or a whole day (I didn't get much further) in the logstash logs, but let me show you some screenshots from Stack Monitoring (with 1 hour period) comparing two hot nodes with the same configuration, first two screenshots are from hot node 1, where the problem doesn't occur:

where you can see, that the heap doesn't get over approx. 7gb and there are none old garbage collections.

These two screenshots are for the problematic hot node 2, where there are peaks of heap memory and they drop down after GC takes care of some old collections (similar behaviour with CMS and G1GC):

Every 5-8 minutes also comes something like this into elasticsearch logs for the problematic node:

2021-04-21T15:52:27,349][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][7055] overhead, spent [1.3s] collecting in the last [1.7s]

Another thing that is strange is that max heap for the problematic node changes even when Xms and Xmx is specified - went a bit more stable when i deleted these two settings from jvm.options, so dynamic heap should be used now:

How could I find out what exactly could be causing all this and how to deal with it?

Thanks in advance for any advice.

radovan · April 21, 2021, 4:43pm

After posting this I found out, that I've got also gc logs (sorry, still much to learn) and they show a lot of these, if somebody can help or explain what should be done?

[2021-04-21T16:40:19.929+0000][178751][gc,age         ] GC(2985) Desired survivor size 278921216 bytes, new threshold 1 (max threshold 15)
[2021-04-21T16:40:19.929+0000][178751][gc,heap        ] GC(2985) PSYoungGen: 3013695K(3120640K)->195157K(3136000K) Eden: 2832896K(2832896K)->0K(2855936K) From: 180799K(287744K)->195157K(280064K)
[2021-04-21T16:40:19.929+0000][178751][gc,heap        ] GC(2985) ParOldGen: 6135779K(6646784K)->6213989K(6646784K)
[2021-04-21T16:40:19.929+0000][178751][gc,metaspace   ] GC(2985) Metaspace: 123304K(128216K)->123304K(128216K) NonClass: 108434K(111832K)->108434K(111832K) Class: 14870K(16384K)->14870K(16384K)
[2021-04-21T16:40:19.929+0000][178751][gc             ] GC(2985) Pause Young (Allocation Failure) 8935M->6258M(9553M) 51.998ms
[2021-04-21T16:40:19.929+0000][178751][gc,cpu         ] GC(2985) User=0.28s Sys=0.01s Real=0.05s
[2021-04-21T16:40:19.929+0000][178751][safepoint      ] Safepoint "ParallelGCFailedAllocation", Time since last: 4526281780 ns, Reaching safepoint: 505606 ns, At safepoint: 52141221 ns, Total: 52646827 ns

warkolm · April 21, 2021, 11:18pm

What is the output from the _cluster/stats?pretty&human API?
What is pushing data into Elasticsearch?

radovan · April 22, 2021, 7:29am

Cluster stats:

{
  "_nodes" : {
    "total" : 19,
    "successful" : 19,
    "failed" : 0
  },
  "cluster_name" : "BC-ELS-production",
  "cluster_uuid" : "WmRIg8hIR0yyXqLxAk7G1w",
  "timestamp" : 1619076138197,
  "status" : "green",
  "indices" : {
    "count" : 733,
    "shards" : {
      "total" : 2779,
      "primaries" : 1393,
      "replication" : 0.9949748743718593,
      "index" : {
        "shards" : {
          "min" : 1,
          "max" : 10,
          "avg" : 3.791268758526603
        },
        "primaries" : {
          "min" : 1,
          "max" : 5,
          "avg" : 1.9004092769440655
        },
        "replication" : {
          "min" : 0.0,
          "max" : 1.0,
          "avg" : 0.9931787175989086
        }
      }
    },
    "docs" : {
      "count" : 9008633015,
      "deleted" : 6601973
    },
    "store" : {
      "size" : "12tb",
      "size_in_bytes" : 13233483885307,
      "reserved" : "0b",
      "reserved_in_bytes" : 0
    },
    "fielddata" : {
      "memory_size" : "60.1mb",
      "memory_size_in_bytes" : 63102608,
      "evictions" : 0
    },
    "query_cache" : {
      "memory_size" : "4.7gb",
      "memory_size_in_bytes" : 5061646484,
      "total_count" : 648297020,
      "hit_count" : 63363812,
      "miss_count" : 584933208,
      "cache_size" : 124024,
      "cache_count" : 637513,
      "evictions" : 513489
    },
    "completion" : {
      "size" : "0b",
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 16769,
      "memory" : "746.3mb",
      "memory_in_bytes" : 782635708,
      "terms_memory" : "523.9mb",
      "terms_memory_in_bytes" : 549360112,
      "stored_fields_memory" : "16.4mb",
      "stored_fields_memory_in_bytes" : 17231384,
      "term_vectors_memory" : "0b",
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory" : "1.2mb",
      "norms_memory_in_bytes" : 1279680,
      "points_memory" : "0b",
      "points_memory_in_bytes" : 0,
      "doc_values_memory" : "204.8mb",
      "doc_values_memory_in_bytes" : 214764532,
      "index_writer_memory" : "880.4mb",
      "index_writer_memory_in_bytes" : 923198658,
      "version_map_memory" : "1.7mb",
      "version_map_memory_in_bytes" : 1844519,
      "fixed_bit_set" : "535.7mb",
      "fixed_bit_set_memory_in_bytes" : 561816240,
      "max_unsafe_auto_id_timestamp" : 1619068256020,
      "file_sizes" : { }
    },
    "mappings" : {
      "field_types" : [
        {
          "name" : "alias",
          "count" : 7168,
          "index_count" : 218
        },
        {
          "name" : "binary",
          "count" : 27,
          "index_count" : 12
        },
        {
          "name" : "boolean",
          "count" : 26506,
          "index_count" : 540
        },
        {
          "name" : "byte",
          "count" : 14,
          "index_count" : 14
        },
        {
          "name" : "constant_keyword",
          "count" : 2,
          "index_count" : 1
        },
        {
          "name" : "date",
          "count" : 28788,
          "index_count" : 685
        },
        {
          "name" : "date_nanos",
          "count" : 7,
          "index_count" : 7
        },
        {
          "name" : "date_range",
          "count" : 7,
          "index_count" : 7
        },
        {
          "name" : "double",
          "count" : 6092,
          "index_count" : 243
        },
        {
          "name" : "double_range",
          "count" : 7,
          "index_count" : 7
        },
        {
          "name" : "flattened",
          "count" : 662,
          "index_count" : 80
        },
        {
          "name" : "float",
          "count" : 8488,
          "index_count" : 487
        },
        {
          "name" : "float_range",
          "count" : 7,
          "index_count" : 7
        },
        {
          "name" : "geo_point",
          "count" : 3649,
          "index_count" : 580
        },
        {
          "name" : "geo_shape",
          "count" : 13,
          "index_count" : 13
        },
        {
          "name" : "half_float",
          "count" : 272,
          "index_count" : 122
        },
        {
          "name" : "integer",
          "count" : 726,
          "index_count" : 207
        },
        {
          "name" : "integer_range",
          "count" : 7,
          "index_count" : 7
        },
        {
          "name" : "ip",
          "count" : 26411,
          "index_count" : 588
        },
        {
          "name" : "ip_range",
          "count" : 7,
          "index_count" : 7
        },
        {
          "name" : "keyword",
          "count" : 753638,
          "index_count" : 683
        },
        {
          "name" : "long",
          "count" : 238175,
          "index_count" : 662
        },
        {
          "name" : "long_range",
          "count" : 7,
          "index_count" : 7
        },
        {
          "name" : "nested",
          "count" : 445,
          "index_count" : 246
        },
        {
          "name" : "object",
          "count" : 191727,
          "index_count" : 680
        },
        {
          "name" : "scaled_float",
          "count" : 784,
          "index_count" : 7
        },
        {
          "name" : "shape",
          "count" : 7,
          "index_count" : 7
        },
        {
          "name" : "short",
          "count" : 21348,
          "index_count" : 225
        },
        {
          "name" : "text",
          "count" : 44496,
          "index_count" : 657
        },
        {
          "name" : "wildcard",
          "count" : 6,
          "index_count" : 6
        }
      ]
    },
    "analysis" : {
      "char_filter_types" : [ ],
      "tokenizer_types" : [
        {
          "name" : "char_group",
          "count" : 7,
          "index_count" : 7
        }
      ],
      "filter_types" : [ ],
      "analyzer_types" : [
        {
          "name" : "custom",
          "count" : 7,
          "index_count" : 7
        }
      ],
      "built_in_char_filters" : [ ],
      "built_in_tokenizers" : [ ],
      "built_in_filters" : [ ],
      "built_in_analyzers" : [ ]
    },
    "versions" : [
      {
        "version" : "6.5.4",
        "index_count" : 11,
        "primary_shard_count" : 31,
        "total_primary_size" : "9.4gb",
        "total_primary_bytes" : 10111205490
      },
      {
        "version" : "6.6.2",
        "index_count" : 1,
        "primary_shard_count" : 1,
        "total_primary_size" : "2.6mb",
        "total_primary_bytes" : 2812687
      },
      {
        "version" : "6.8.0",
        "index_count" : 5,
        "primary_shard_count" : 5,
        "total_primary_size" : "6.1mb",
        "total_primary_bytes" : 6477841
      },
      {
        "version" : "7.4.2",
        "index_count" : 11,
        "primary_shard_count" : 11,
        "total_primary_size" : "16.5mb",
        "total_primary_bytes" : 17315326
      },
      {
        "version" : "7.6.0",
        "index_count" : 5,
        "primary_shard_count" : 5,
        "total_primary_size" : "336.2mb",
        "total_primary_bytes" : 352619901
      },
      {
        "version" : "7.6.1",
        "index_count" : 4,
        "primary_shard_count" : 4,
        "total_primary_size" : "29.2mb",
        "total_primary_bytes" : 30661253
      },
      {
        "version" : "7.7.0",
        "index_count" : 9,
        "primary_shard_count" : 9,
        "total_primary_size" : "291.1mb",
        "total_primary_bytes" : 305327591
      },
      {
        "version" : "7.8.0",
        "index_count" : 11,
        "primary_shard_count" : 11,
        "total_primary_size" : "1.6gb",
        "total_primary_bytes" : 1747748593
      },
      {
        "version" : "7.8.1",
        "index_count" : 3,
        "primary_shard_count" : 3,
        "total_primary_size" : "277.9mb",
        "total_primary_bytes" : 291406063
      },
      {
        "version" : "7.9.0",
        "index_count" : 281,
        "primary_shard_count" : 589,
        "total_primary_size" : "2.5tb",
        "total_primary_bytes" : 2821912536356
      },
      {
        "version" : "7.10.2",
        "index_count" : 157,
        "primary_shard_count" : 291,
        "total_primary_size" : "1.2tb",
        "total_primary_bytes" : 1421334823052
      },
      {
        "version" : "7.11.1",
        "index_count" : 60,
        "primary_shard_count" : 112,
        "total_primary_size" : "603.5gb",
        "total_primary_bytes" : 648095327538
      },
      {
        "version" : "7.11.2",
        "index_count" : 50,
        "primary_shard_count" : 90,
        "total_primary_size" : "608.8gb",
        "total_primary_bytes" : 653758419940
      },
      {
        "version" : "7.12.0",
        "index_count" : 125,
        "primary_shard_count" : 231,
        "total_primary_size" : "985.6gb",
        "total_primary_bytes" : 1058356704134
      }
    ]
  },
  "nodes" : {
    "count" : {
      "total" : 19,
      "coordinating_only" : 0,
      "data" : 0,
      "data_cold" : 3,
      "data_content" : 2,
      "data_frozen" : 0,
      "data_hot" : 3,
      "data_warm" : 4,
      "ingest" : 2,
      "master" : 3,
      "ml" : 0,
      "remote_cluster_client" : 2,
      "transform" : 0,
      "voting_only" : 0
    },
    "versions" : [
      "7.12.0"
    ],
    "os" : {
      "available_processors" : 102,
      "allocated_processors" : 102,
      "names" : [
        {
          "name" : "Linux",
          "count" : 19
        }
      ],
      "pretty_names" : [
        {
          "pretty_name" : "Ubuntu 18.04.5 LTS",
          "count" : 13
        },
        {
          "pretty_name" : "Ubuntu 16.04.7 LTS",
          "count" : 6
        }
      ],
      "architectures" : [
        {
          "arch" : "amd64",
          "count" : 19
        }
      ],
      "mem" : {
        "total" : "183.7gb",
        "total_in_bytes" : 197260746752,
        "free" : "6gb",
        "free_in_bytes" : 6506188800,
        "used" : "177.6gb",
        "used_in_bytes" : 190754557952,
        "free_percent" : 3,
        "used_percent" : 97
      }
    },
    "process" : {
      "cpu" : {
        "percent" : 226
      },
      "open_file_descriptors" : {
        "min" : 715,
        "max" : 2355,
        "avg" : 1602
      }
    },
    "jvm" : {
      "max_uptime" : "22.7d",
      "max_uptime_in_millis" : 1968751670,
      "versions" : [
        {
          "version" : "15.0.1",
          "vm_name" : "OpenJDK 64-Bit Server VM",
          "vm_version" : "15.0.1+9",
          "vm_vendor" : "AdoptOpenJDK",
          "bundled_jdk" : true,
          "using_bundled_jdk" : true,
          "count" : 19
        }
      ],
      "mem" : {
        "heap_used" : "54.3gb",
        "heap_used_in_bytes" : 58325042592,
        "heap_max" : "93.4gb",
        "heap_max_in_bytes" : 100381753344
      },
      "threads" : 1615
    },
    "fs" : {
      "total" : "20.2tb",
      "total_in_bytes" : 22291266891776,
      "free" : "8.1tb",
      "free_in_bytes" : 8944382054400,
      "available" : "7.1tb",
      "available_in_bytes" : 7844371730432
    },
    "plugins" : [ ],
    "network_types" : {
      "transport_types" : {
        "security4" : 19
      },
      "http_types" : {
        "security4" : 19
      }
    },
    "discovery_types" : {
      "zen" : 19
    },
    "packaging_types" : [
      {
        "flavor" : "default",
        "type" : "deb",
        "count" : 19
      }
    ],

radovan · April 22, 2021, 7:30am

(had to split the output due to char limit)

"ingest" : {
      "number_of_pipelines" : 72,
      "processor_stats" : {
        "append" : {
          "count" : 80857370,
          "failed" : 0,
          "current" : 0,
          "time" : "6.3m",
          "time_in_millis" : 382589
        },
        "conditional" : {
          "count" : 3083340784,
          "failed" : 0,
          "current" : 0,
          "time" : "3.1d",
          "time_in_millis" : 269357188
        },
        "convert" : {
          "count" : 346124167,
          "failed" : 0,
          "current" : 0,
          "time" : "3.1h",
          "time_in_millis" : 11179549
        },
        "date" : {
          "count" : 607581106,
          "failed" : 358536183,
          "current" : 0,
          "time" : "6.6h",
          "time_in_millis" : 23768998
        },
        "foreach" : {
          "count" : 34553239,
          "failed" : 0,
          "current" : 0,
          "time" : "19.6m",
          "time_in_millis" : 1179829
        },
        "geoip" : {
          "count" : 1442386224,
          "failed" : 389265164,
          "current" : 0,
          "time" : "23.7h",
          "time_in_millis" : 85667748
        },
        "grok" : {
          "count" : 876927559,
          "failed" : 1362773,
          "current" : 0,
          "time" : "4.7d",
          "time_in_millis" : 412095079
        },
        "gsub" : {
          "count" : 133412739,
          "failed" : 0,
          "current" : 0,
          "time" : "1h",
          "time_in_millis" : 3673523
        },
        "join" : {
          "count" : 44470913,
          "failed" : 41619776,
          "current" : 0,
          "time" : "21.8m",
          "time_in_millis" : 1309972
        },
        "kv" : {
          "count" : 94255908,
          "failed" : 11058,
          "current" : 0,
          "time" : "35m",
          "time_in_millis" : 2104149
        },
        "lowercase" : {
          "count" : 1147094874,
          "failed" : 0,
          "current" : 0,
          "time" : "4.8h",
          "time_in_millis" : 17426767
        },
        "remove" : {
          "count" : 1949364369,
          "failed" : 446110747,
          "current" : 0,
          "time" : "9.1h",
          "time_in_millis" : 32981764
        },
        "rename" : {
          "count" : 6417873891,
          "failed" : 61033546,
          "current" : 0,
          "time" : "11.3h",
          "time_in_millis" : 40817209
        },
        "script" : {
          "count" : 1494881311,
          "failed" : 388811029,
          "current" : 0,
          "time" : "1.7d",
          "time_in_millis" : 149340170
        },
        "set" : {
          "count" : 3159003019,
          "failed" : 0,
          "current" : 0,
          "time" : "1d",
          "time_in_millis" : 89134088
        },
        "split" : {
          "count" : 126741340,
          "failed" : 2656007,
          "current" : 0,
          "time" : "43.6m",
          "time_in_millis" : 2619729
        },
        "uppercase" : {
          "count" : 44470913,
          "failed" : 0,
          "current" : 0,
          "time" : "22.2m",
          "time_in_millis" : 1335342
        },
        "user_agent" : {
          "count" : 329311425,
          "failed" : 14382917,
          "current" : 0,
          "time" : "2.6h",
          "time_in_millis" : 9371588
        }
      }
    }
  }
}

Only logstash machines (3) are pushing data into Elasticsearch. And maybe internal monitoring (for Stack Monitoring).

warkolm · April 22, 2021, 9:15am

Thanks for that!

You should look at increasing the size of your shards. It's going to be wasting resources given the average size is about 4GB.

radovan:

"pretty_names" : [
        {
          "pretty_name" : "Ubuntu 18.04.5 LTS",
          "count" : 13
        },
        {
          "pretty_name" : "Ubuntu 16.04.7 LTS",
          "count" : 6
        }
      ],

You should really also be running the same OS. That's a big difference!

You posted this earlier, but can you post the log from before and after it as well?

radovan · April 22, 2021, 10:06am

How to accomplish that? We're using ILM policy that is mostly setup for 30GB/30d, but I admit, that there might be some indices, that even after 30 days may not hit the appropriate size, so should I just increase the time threshold for rollover?

I'm planning to do that, but got plenty of other tasks, so didn't have time for that. At the moment there is 1 management node, 2 ingest nodes, 2 warm nodes and 1 controller node on ubuntu 16, so could this be a reason for something regarding this problem?

So after the nodes is started and stuff, I get this (before 15:52)

[2021-04-21T13:59:20,073][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][284] overhead, spent [1.6s] collecting in the last [2.1s]
[2021-04-21T14:02:29,354][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][472] overhead, spent [1.1s] collecting in the last [1.9s]
[2021-04-21T14:06:45,602][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][727] overhead, spent [997ms] collecting in the last [1.9s]
[2021-04-21T14:11:30,561][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][1011] overhead, spent [1s] collecting in the last [1.5s]
[2021-04-21T14:17:13,341][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][1353] overhead, spent [853ms] collecting in the last [1.5s]
[2021-04-21T14:22:52,324][INFO ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][1691] overhead, spent [765ms] collecting in the last [1.7s]
[2021-04-21T14:28:27,182][INFO ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][2025] overhead, spent [734ms] collecting in the last [1.5s]
[2021-04-21T14:33:37,814][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][2335] overhead, spent [1s] collecting in the last [1.4s]
[2021-04-21T14:39:11,716][INFO ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][2668] overhead, spent [756ms] collecting in the last [1.6s]
[2021-04-21T14:45:29,261][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][3045] overhead, spent [772ms] collecting in the last [1.1s]
[2021-04-21T14:51:23,210][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][3398] overhead, spent [1s] collecting in the last [1.6s]
[2021-04-21T14:58:35,684][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][3830] overhead, spent [893ms] collecting in the last [1.1s]
[2021-04-21T15:04:00,199][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][4154] overhead, spent [758ms] collecting in the last [1.2s]
[2021-04-21T15:09:58,597][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][4511] overhead, spent [1s] collecting in the last [1.1s]
[2021-04-21T15:15:42,056][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][4853] overhead, spent [977ms] collecting in the last [1.4s]
[2021-04-21T15:23:03,655][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][5294] overhead, spent [775ms] collecting in the last [1.1s]
[2021-04-21T15:30:02,092][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][5712] overhead, spent [829ms] collecting in the last [1.1s]
[2021-04-21T15:38:01,774][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][6191] overhead, spent [786ms] collecting in the last [1.3s]
[2021-04-21T15:44:41,215][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][6590] overhead, spent [715ms] collecting in the last [1.2s]

and after:

[2021-04-21T15:58:42,608][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][7430] overhead, spent [950ms] collecting in the last [1s]
[2021-04-21T16:05:37,172][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][7844] overhead, spent [758ms] collecting in the last [1.2s]
[2021-04-21T16:12:15,437][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][8242] overhead, spent [861ms] collecting in the last [1s]
[2021-04-21T16:19:28,928][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][8675] overhead, spent [763ms] collecting in the last [1.2s]
[2021-04-21T16:26:43,825][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][9109] overhead, spent [867ms] collecting in the last [1.6s]
[2021-04-21T16:34:39,745][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][9584] overhead, spent [920ms] collecting in the last [1.3s]
[2021-04-21T16:43:17,913][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][10101] overhead, spent [790ms] collecting in the last [1.5s]
[2021-04-21T16:50:34,412][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][10537] overhead, spent [799ms] collecting in the last [1s]
[2021-04-21T16:58:26,258][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][11008] overhead, spent [870ms] collecting in the last [1.4s]
[2021-04-21T17:06:00,269][INFO ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][11461] overhead, spent [791ms] collecting in the last [1.6s]
[2021-04-21T17:15:00,783][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][12001] overhead, spent [756ms] collecting in the last [1.2s]
[2021-04-21T17:23:42,235][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][12522] overhead, spent [904ms] collecting in the last [1.1s]
[2021-04-21T17:32:09,492][INFO ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][13028] overhead, spent [929ms] collecting in the last [1.9s]
[2021-04-21T17:40:46,274][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][13544] overhead, spent [847ms] collecting in the last [1.4s]
[2021-04-21T17:49:30,257][INFO ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][14067] overhead, spent [738ms] collecting in the last [1.6s]
[2021-04-21T17:57:14,573][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][14531] overhead, spent [965ms] collecting in the last [1s]
[2021-04-21T18:06:07,932][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][15064] overhead, spent [853ms] collecting in the last [1s]
[2021-04-21T18:14:49,579][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][15585] overhead, spent [785ms] collecting in the last [1.3s]
[2021-04-21T18:22:50,931][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][16066] overhead, spent [870ms] collecting in the last [1s]
[2021-04-21T18:32:03,544][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][16618] overhead, spent [797ms] collecting in the last [1.1s]
[2021-04-21T18:41:16,223][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][17170] overhead, spent [1.3s] collecting in the last [1.3s]
[2021-04-21T18:49:57,755][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][17691] overhead, spent [844ms] collecting in the last [1.1s]
[2021-04-21T18:59:11,410][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][18244] overhead, spent [1.1s] collecting in the last [1.3s]
[2021-04-21T19:07:28,354][INFO ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][18740] overhead, spent [815ms] collecting in the last [1.6s]
[2021-04-21T19:17:00,667][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][19312] overhead, spent [830ms] collecting in the last [1s]
[2021-04-21T19:25:09,849][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][19800] overhead, spent [928ms] collecting in the last [1.8s]
[2021-04-21T19:34:07,634][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][20337] overhead, spent [741ms] collecting in the last [1.4s]
[2021-04-21T19:42:49,586][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][20858] overhead, spent [874ms] collecting in the last [1s]
[2021-04-21T19:53:00,353][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][21468] overhead, spent [788ms] collecting in the last [1.4s]
[2021-04-21T20:02:20,306][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][22027] overhead, spent [889ms] collecting in the last [1.5s]
[2021-04-21T20:11:55,080][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][22601] overhead, spent [866ms] collecting in the last [1.3s]
[2021-04-21T20:20:54,715][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][23140] overhead, spent [758ms] collecting in the last [1.2s]
[2021-04-21T20:30:13,298][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][23698] overhead, spent [855ms] collecting in the last [1.3s]
[2021-04-21T20:39:44,814][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][24269] overhead, spent [749ms] collecting in the last [1.1s]
[2021-04-21T20:50:49,599][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][24933] overhead, spent [808ms] collecting in the last [1.4s]
[2021-04-21T21:01:45,564][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][25588] overhead, spent [899ms] collecting in the last [1.5s]
[2021-04-21T21:12:37,411][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][26239] overhead, spent [764ms] collecting in the last [1.5s]
[2021-04-21T21:23:09,456][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][26870] overhead, spent [894ms] collecting in the last [1.7s]
[2021-04-21T21:32:44,086][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][27444] overhead, spent [848ms] collecting in the last [1.3s]
[2021-04-21T21:42:22,589][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][28022] overhead, spent [1s] collecting in the last [1.1s]
[2021-04-21T21:53:05,801][INFO ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][28664] overhead, spent [942ms] collecting in the last [1.9s]
[2021-04-21T22:02:57,004][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][29254] overhead, spent [1s] collecting in the last [1.9s]
[2021-04-21T22:12:45,781][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][29842] overhead, spent [741ms] collecting in the last [1.4s]
[2021-04-21T22:22:19,159][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][30415] overhead, spent [796ms] collecting in the last [1s]
[2021-04-21T22:31:07,741][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][30943] overhead, spent [753ms] collecting in the last [1.2s]
[2021-04-21T22:40:26,108][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][31501] overhead, spent [857ms] collecting in the last [1s]
[2021-04-21T22:50:16,492][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][32091] overhead, spent [741ms] collecting in the last [1s]
[2021-04-21T22:59:49,598][INFO ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][32663] overhead, spent [863ms] collecting in the last [1.7s]
[2021-04-21T23:08:45,452][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][33198] overhead, spent [933ms] collecting in the last [1.6s]
[2021-04-21T23:17:12,296][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][33704] overhead, spent [938ms] collecting in the last [1.6s]
[2021-04-21T23:25:40,886][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][34212] overhead, spent [742ms] collecting in the last [1s]
[2021-04-21T23:34:23,226][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][34734] overhead, spent [782ms] collecting in the last [1s]
[2021-04-21T23:41:37,339][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][35167] overhead, spent [948ms] collecting in the last [1.7s]
[2021-04-21T23:50:17,972][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][35687] overhead, spent [946ms] collecting in the last [1.2s]
[2021-04-21T23:58:00,412][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][36149] overhead, spent [773ms] collecting in the last [1.1s]

radovan · April 23, 2021, 9:31am

So I found some empty indices, that were creating constantly because of uploading new index templates (seems like it uploads also an corresponding ILM policy) when there was a new beats release -> got from 118 indices and 155 shards to 90 indices and 113 shards on this hot node. The heap improved (not right after, but during the night) A LOT:

Although this still comes into log pretty often:

[2021-04-23T09:39:09,683][INFO ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][63857] overhead, spent [649ms] collecting in the last [1.3s]
[2021-04-23T09:42:45,069][INFO ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][64072] overhead, spent [596ms] collecting in the last [1.2s]
[2021-04-23T09:46:16,764][INFO ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][64283] overhead, spent [640ms] collecting in the last [1.6s]
[2021-04-23T09:50:22,891][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][64529] overhead, spent [646ms] collecting in the last [1s]
[2021-04-23T09:54:15,516][INFO ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][64761] overhead, spent [583ms] collecting in the last [1.5s]
[2021-04-23T09:57:40,133][INFO ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][64965] overhead, spent [754ms] collecting in the last [1.5s]
[2021-04-23T10:01:34,320][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][65199] overhead, spent [611ms] collecting in the last [1s]
[2021-04-23T10:04:56,633][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][65401] overhead, spent [676ms] collecting in the last [1.2s]
[2021-04-23T10:09:04,730][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][65649] overhead, spent [679ms] collecting in the last [1s]
[2021-04-23T10:12:31,219][INFO ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][65855] overhead, spent [575ms] collecting in the last [1.4s]
[2021-04-23T10:16:01,359][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][66065] overhead, spent [585ms] collecting in the last [1s]
[2021-04-23T10:19:24,779][INFO ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][66268] overhead, spent [678ms] collecting in the last [1.3s]
[2021-04-23T10:21:54,826][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][66418] overhead, spent [598ms] collecting in the last [1s]
[2021-04-23T10:24:29,420][INFO ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][66572] overhead, spent [566ms] collecting in the last [1.5s]
[2021-04-23T10:27:31,773][INFO ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][66754] overhead, spent [598ms] collecting in the last [1.2s]
[2021-04-23T10:30:09,847][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][66912] overhead, spent [669ms] collecting in the last [1s]
[2021-04-23T10:32:51,899][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][67074] overhead, spent [667ms] collecting in the last [1s]
[2021-04-23T10:35:44,957][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][67247] overhead, spent [763ms] collecting in the last [1s]
[2021-04-23T10:39:19,041][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][67461] overhead, spent [830ms] collecting in the last [1s]
[2021-04-23T10:42:32,189][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][67654] overhead, spent [579ms] collecting in the last [1s]
[2021-04-23T10:45:48,261][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][67850] overhead, spent [643ms] collecting in the last [1s]
[2021-04-23T10:49:13,325][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][68055] overhead, spent [624ms] collecting in the last [1s]
[2021-04-23T10:51:49,554][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][68211] overhead, spent [665ms] collecting in the last [1.1s]
[2021-04-23T10:54:51,660][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][68393] overhead, spent [605ms] collecting in the last [1s]
[2021-04-23T10:57:31,867][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][68553] overhead, spent [607ms] collecting in the last [1.1s]
[2021-04-23T11:00:41,237][INFO ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][68742] overhead, spent [567ms] collecting in the last [1.3s]
[2021-04-23T11:03:05,746][INFO ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][68886] overhead, spent [586ms] collecting in the last [1.4s]
[2021-04-23T11:05:14,032][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][69014] overhead, spent [662ms] collecting in the last [1.2s]
[2021-04-23T11:07:52,242][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][69172] overhead, spent [756ms] collecting in the last [1.1s]
[2021-04-23T11:10:49,361][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][69349] overhead, spent [745ms] collecting in the last [1s]
[2021-04-23T11:13:01,609][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][69481] overhead, spent [670ms] collecting in the last [1.1s]
[2021-04-23T11:16:56,710][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][69716] overhead, spent [617ms] collecting in the last [1s]
[2021-04-23T11:20:36,895][WARN ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-02] [gc][69936] overhead, spent [646ms] collecting in the last [1.1s]

In the meanwhile this is on the first hot node, where it's almost like once per hour:

[2021-04-23T05:43:51,466][INFO ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-01] [gc][575160] overhead, spent [406ms] collecting in the last [1.3s]
[2021-04-23T06:30:19,073][INFO ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-01] [gc][577944] overhead, spent [279ms] collecting in the last [1s]
[2021-04-23T07:32:35,618][INFO ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-01] [gc][581675] overhead, spent [304ms] collecting in the last [1.1s]
[2021-04-23T08:46:19,006][INFO ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-01] [gc][586091] overhead, spent [329ms] collecting in the last [1s]
[2021-04-23T09:05:21,383][INFO ][o.e.m.j.JvmGcMonitorService] [ELK-D-H-01] [gc][587232] overhead, spent [257ms] collecting in the last [1s]

And still don't know, why the problematic node doesn't take in account that I set up Xms and Xmx in the jvm.options and he still acts like he had dynamic max heap.

And another thing I don't know is why these old collections appear only on the problematic node:

radovan · April 26, 2021, 10:49am

Today at 10AM the heap started increasing again..

And here it is for last 3 hours where you can see also how the max heap started scaling even tho I specified Xms and Xmx the same as for other hosts:

warkolm · April 28, 2021, 1:19am

Can you post the elasticsearch.yml and the entire log from startup for that node? We want to make sure those settings are being applied correctly.

radovan · April 28, 2021, 7:39am

Seems like the solution was to use these jvm settings:

10-:-XX:+UseG1GC
10-:-XX:G1ReservePercent=25
10-:-XX:InitiatingHeapOccupancyPercent=30
10-:-XX:+UseStringDeduplication
10-:-XX:G1HeapRegionSize=16M

instead of these:

8-13:-XX:+UseConcMarkSweepGC
8-13:-XX:CMSInitiatingOccupancyFraction=65
8-13:-XX:+UseCMSInitiatingOccupancyOnly

and commenting

ES_JAVA_OPTS=-XX:+UseParallelGC

in /etc/default/elasticsearch. Even the max heap size is now the actual value, that I set up for it in jvm.options. The problem was that I didn't know that elastic doesn't use system java, but the bundled java that comes with the ubuntu package, so the java version wasn't 8 but 15 and that's why elasticsearch didn't consider the settings i put there even tho he could read them.

system · May 26, 2021, 7:39am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.