Overhead overspend issues

I'm having huge problems getting my 3 node ES setup to stay up for more than an hour. Nothing has changed leading up to this although I was seeing issues about JVM running out of heap space. I've upgraded successfully from 6.0.0 to 6.2.1.
The problem persists. My logs are full of many instances of:
[2018-02-12T16:19:03,414][WARN ][o.e.m.j.JvmGcMonitorService] [es-server-01] [gc][585] overhead, spent [10.5s] collecting in the last [10.7s]
and the services across all 3 ES nodes fail. I've tried to fix this for the last week or so but am getting nowhere.

Log header

[2018-02-12T16:07:03,701][INFO ][o.e.n.Node               ] [es-server1] initializing ...
[2018-02-12T16:07:03,873][INFO ][o.e.e.NodeEnvironment    ] [es-server1] using [1] data paths, mounts [[DATA (D:)]], net usable_space [79.5gb], net total_space [99.9gb], types [NTFS]
[2018-02-12T16:07:03,873][INFO ][o.e.e.NodeEnvironment    ] [es-server1] heap size [3.9gb], compressed ordinary object pointers [true]
[2018-02-12T16:07:15,757][INFO ][o.e.n.Node               ] [es-server1] node name [gs05-srv227], node ID [2HovUhtrTPqSt13ZPP8w4w]
[2018-02-12T16:07:15,757][INFO ][o.e.n.Node               ] [es-server1] version[6.2.1], pid[692], build[7299dc3/2018-02-07T19:34:26.990113Z], OS[Windows Server 2012 R2/6.3/amd64], JVM[Oracle Corporation/Java HotSpot(TM) 64-Bit Server VM/1.8.0_144/25.144-b01]
[2018-02-12T16:07:15,757][INFO ][o.e.n.Node               ] [es-server1] JVM arguments [-Xms4g, -Xmx4g, -XX:+UseConcMarkSweepGC, -XX:CMSInitiatingOccupancyFraction=75, -XX:+UseCMSInitiatingOccupancyOnly, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -XX:-OmitStackTraceInFastThrow, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Djava.io.tmpdir=C:\Temp\elasticsearch, -XX:+HeapDumpOnOutOfMemoryError, -XX:+PrintGCDetails, -XX:+PrintGCDateStamps, -XX:+PrintTenuringDistribution, -XX:+PrintGCApplicationStoppedTime, -Xloggc:logs/gc.log, -XX:+UseGCLogFileRotation, -XX:NumberOfGCLogFiles=32, -XX:GCLogFileSize=64m, -Delasticsearch, -Des.path.home=c:\elasticsearch, -Des.path.conf=c:\elasticsearch\config, exit, -Xms4098m, -Xmx4098m, -Xss1024k]

Steps I've taken

  1. I've set it in the Environment variable as: ES_HEAP_SIZE with a value of 4g before installing the service

  2. After installation of the service, used the service manager to set the Java options to:
    -Xms4g
    -Xmx4g
    -XX:+UseConcMarkSweepGC
    -XX:CMSInitiatingOccupancyFraction=75
    -XX:+UseCMSInitiatingOccupancyOnly
    -XX:+AlwaysPreTouch
    -Xss1m
    -Djava.awt.headless=true
    -Dfile.encoding=UTF-8
    -Djna.nosys=true
    -XX:-OmitStackTraceInFastThrow
    -Dio.netty.noUnsafe=true
    -Dio.netty.noKeySetOptimization=true
    -Dio.netty.recycler.maxCapacityPerThread=0
    -Dlog4j.shutdownHookEnabled=false
    -Dlog4j2.disable.jmx=true
    -Djava.io.tmpdir=C:\Temp\elasticsearch
    -XX:+HeapDumpOnOutOfMemoryError
    -XX:+PrintGCDetails
    -XX:+PrintGCDateStamps
    -XX:+PrintTenuringDistribution
    -XX:+PrintGCApplicationStoppedTime
    -Xloggc:logs/gc.log
    -XX:+UseGCLogFileRotation
    -XX:NumberOfGCLogFiles=32
    -XX:GCLogFileSize=64m
    -Delasticsearch
    -Des.path.home=c:\elasticsearch
    -Des.path.conf=c:\elasticsearch\config

  3. Set Initial memory pool is: 4098, Maximum memory pool is: 4098. Thread stack size: 1024

  4. Set the config.yml to contain: bootstrap.memory_lock: true

Server specifics
Windows 2012 R2
8GB RAM
ES V 6.2.1
LS: 6.2.1
KB: 6.2.1

Any suggestions welcome!

Can you provide some us statistics from the cluster by running the cluster stats API?

Now that provides some interesting info...
Why is my heap max higher than the available memory I have in the server?

{
  "_nodes" : {
    "total" : 3,
    "successful" : 3,
    "failed" : 0
  },
  "cluster_name" : "gs-elasticsearch",
  "timestamp" : 1518454797203,
  "status" : "yellow",
  "indices" : {
    "count" : 175,
    "shards" : {
      "total" : 871,
      "primaries" : 871,
      "replication" : 0.0,
      "index" : {
        "shards" : {
          "min" : 1,
          "max" : 5,
          "avg" : 4.977142857142857
        },
        "primaries" : {
          "min" : 1,
          "max" : 5,
          "avg" : 4.977142857142857
        },
        "replication" : {
          "min" : 0.0,
          "max" : 0.0,
          "avg" : 0.0
        }
      }
    },
    "docs" : {
      "count" : 40184660,
      "deleted" : 111
    },
    "store" : {
      "size" : "27.7gb",
      "size_in_bytes" : 29793940379
    },
    "fielddata" : {
      "memory_size" : "5.9mb",
      "memory_size_in_bytes" : 6261448,
      "evictions" : 0
    },
    "query_cache" : {
      "memory_size" : "845.5kb",
      "memory_size_in_bytes" : 865889,
      "total_count" : 2320,
      "hit_count" : 420,
      "miss_count" : 1900,
      "cache_size" : 52,
      "cache_count" : 52,
      "evictions" : 0
    },
    "completion" : {
      "size" : "0b",
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 4543,
      "memory" : "127.8mb",
      "memory_in_bytes" : 134092950,
      "terms_memory" : "99.9mb",
      "terms_memory_in_bytes" : 104776149,
      "stored_fields_memory" : "12mb",
      "stored_fields_memory_in_bytes" : 12621136,
      "term_vectors_memory" : "0b",
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory" : "7.9mb",
      "norms_memory_in_bytes" : 8316864,
      "points_memory" : "3.1mb",
      "points_memory_in_bytes" : 3348269,
      "doc_values_memory" : "4.7mb",
      "doc_values_memory_in_bytes" : 5030532,
      "index_writer_memory" : "16.7mb",
      "index_writer_memory_in_bytes" : 17535256,
      "version_map_memory" : "0b",
      "version_map_memory_in_bytes" : 0,
      "fixed_bit_set" : "0b",
      "fixed_bit_set_memory_in_bytes" : 0,
      "max_unsafe_auto_id_timestamp" : 1518452370398,
      "file_sizes" : { }
    }
  },
  "nodes" : {
    "count" : {
      "total" : 3,
      "data" : 3,
      "coordinating_only" : 0,
      "master" : 3,
      "ingest" : 3
    },
    "versions" : [
      "6.2.1"
    ],
    "os" : {
      "available_processors" : 12,
      "allocated_processors" : 12,
      "names" : [
        {
          "name" : "Windows Server 2012 R2",
          "count" : 3
        }
      ],
      "mem" : {
        "total" : "35.9gb",
        "total_in_bytes" : 38652887040,
        "free" : "18gb",
        "free_in_bytes" : 19374075904,
        "used" : "17.9gb",
        "used_in_bytes" : 19278811136,
        "free_percent" : 50,
        "used_percent" : 50
      }
    },
    "process" : {
      "cpu" : {
        "percent" : 48
      },
      "open_file_descriptors" : {
        "min" : -1,
        "max" : -1,
        "avg" : 0
      }
    },
    "jvm" : {
      "max_uptime" : "52.9m",
      "max_uptime_in_millis" : 3176109,
      "versions" : [
        {
          "version" : "1.8.0_144",
          "vm_name" : "Java HotSpot(TM) 64-Bit Server VM",
          "vm_version" : "25.144-b01",
          "vm_vendor" : "Oracle Corporation",
          "count" : 3
        }
      ],
      "mem" : {
        "heap_used" : "1.9gb",
        "heap_used_in_bytes" : 2044156432,
       ** "heap_max" : "11.9gb",**
        "heap_max_in_bytes" : 12786597888
      },
      "threads" : 330
    },
    "fs" : {
      "total" : "299.9gb",
      "total_in_bytes" : 322113097728,
      "free" : "234.4gb",
      "free_in_bytes" : 251707219968,
      "available" : "234.4gb",
      "available_in_bytes" : 251707219968
    },
    "plugins" : [ ],
    "network_types" : {
      "transport_types" : {
        "netty4" : 3
      },
      "http_types" : {
        "netty4" : 3
      }
    }
  }
}

You said you had 3 nodes with 4GB heap each, right? That is a total of 12GB. Are these not running on separate hosts?

I do. My fault, was thinking per node. I have 3 ES nodes, 1 KB server and 1 LS server. So nothing in there that looks "off" to you?

Additional log info when the nodes stay up for a while (usually when there's less logs being thrown at it I assume...

[2018-02-12T19:17:41,159][INFO ][o.e.m.j.JvmGcMonitorService] [es-srv1] [gc][old][5797][37] duration [8.8s], collections [1]/[9.8s], total [8.8s]/[3.6m], memory [3.4gb]->[2.8gb]/[3.9gb], all_pools {[young] [4.3kb]->[1.9mb]/[266.2mb]}{[survivor] [33.2mb]->[0b]/[33.2mb]}{[old] [3.4gb]->[2.7gb]/[3.6gb]}
[2018-02-12T19:17:41,159][WARN ][o.e.m.j.JvmGcMonitorService] [es-srv1] [gc][5797] overhead, spent [9.2s] collecting in the last [9.8s]
[2018-02-12T19:17:42,174][WARN ][o.e.m.j.JvmGcMonitorService] [es-srv1] [gc][5798] overhead, spent [548ms] collecting in the last [1s]
[2018-02-12T19:17:43,190][WARN ][o.e.m.j.JvmGcMonitorService] [es-srv1] [gc][5799] overhead, spent [557ms] collecting in the last [1s]
[2018-02-12T19:17:44,206][INFO ][o.e.m.j.JvmGcMonitorService] [es-srv1] [gc][5800] overhead, spent [388ms] collecting in the last [1s]
[2018-02-12T19:17:51,065][INFO ][o.e.m.j.JvmGcMonitorService] [es-srv1] [gc][old][5801][38] duration [6.4s], collections [1]/[6.8s], total [6.4s]/[3.7m], memory [3.7gb]->[2.1gb]/[3.9gb], all_pools {[young] [130.8mb]->[4.1mb]/[266.2mb]}{[survivor] [33.2mb]->[0b]/[33.2mb]}{[old] [3.5gb]->[2.1gb]/[3.6gb]}
[2018-02-12T19:17:51,065][WARN ][o.e.m.j.JvmGcMonitorService] [es-srv1] [gc][5801] overhead, spent [6.6s] collecting in the last [6.8s]
[2018-02-12T19:17:52,112][INFO ][o.e.m.j.JvmGcMonitorService] [es-srv1] [gc][5802] overhead, spent [440ms] collecting in the last [1s]

Any ideas? I'm at a point where I can't see any other option but wiping out my whole installation and starting again.

Do you have X-Pack monitoring installed so we can get a picture of how heap usage varies over time?

Sorry Christian, I've just replied to you regarding another post I'd made along the same lines. To save you having to search it out, here's the reply:

So I've used Cerebro to get a better overview of the issue. I can now watch the JVM heap size on each node slowly increasing to sometimes 90% on one of the nodes. I can see that I have 42,087,876 docs, 1782 shards spread across 3 nodes and a total size of 58GB, each with now 4GB of JVM heap. Based on the above, can you confirm that the reason I am seeing all these timeouts is down to the sheer number of shards/docs?

If this is the case, what are the steps to reduce them? I only have 179 indices. Is it a config issue that I've screwed up on the initial build?

In terms of heap usage over time. I can watch it climb from 32% to well into the 60% or higher in 5 mins or less in some instances. Having run a Curator cleanup, that usage has decreased but it'll be up again in the next few hours. There are only 5 indexes I've created pushing data into this cluster so I'm slightly confused as to how I'm overloading it!

Thanks.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.