Overhead overspend issues


#1

I'm having huge problems getting my 3 node ES setup to stay up for more than an hour. Nothing has changed leading up to this although I was seeing issues about JVM running out of heap space. I've upgraded successfully from 6.0.0 to 6.2.1.
The problem persists. My logs are full of many instances of:
[2018-02-12T16:19:03,414][WARN ][o.e.m.j.JvmGcMonitorService] [es-server-01] [gc][585] overhead, spent [10.5s] collecting in the last [10.7s]
and the services across all 3 ES nodes fail. I've tried to fix this for the last week or so but am getting nowhere.

Log header

[2018-02-12T16:07:03,701][INFO ][o.e.n.Node               ] [es-server1] initializing ...
[2018-02-12T16:07:03,873][INFO ][o.e.e.NodeEnvironment    ] [es-server1] using [1] data paths, mounts [[DATA (D:)]], net usable_space [79.5gb], net total_space [99.9gb], types [NTFS]
[2018-02-12T16:07:03,873][INFO ][o.e.e.NodeEnvironment    ] [es-server1] heap size [3.9gb], compressed ordinary object pointers [true]
[2018-02-12T16:07:15,757][INFO ][o.e.n.Node               ] [es-server1] node name [gs05-srv227], node ID [2HovUhtrTPqSt13ZPP8w4w]
[2018-02-12T16:07:15,757][INFO ][o.e.n.Node               ] [es-server1] version[6.2.1], pid[692], build[7299dc3/2018-02-07T19:34:26.990113Z], OS[Windows Server 2012 R2/6.3/amd64], JVM[Oracle Corporation/Java HotSpot(TM) 64-Bit Server VM/1.8.0_144/25.144-b01]
[2018-02-12T16:07:15,757][INFO ][o.e.n.Node               ] [es-server1] JVM arguments [-Xms4g, -Xmx4g, -XX:+UseConcMarkSweepGC, -XX:CMSInitiatingOccupancyFraction=75, -XX:+UseCMSInitiatingOccupancyOnly, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -XX:-OmitStackTraceInFastThrow, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Djava.io.tmpdir=C:\Temp\elasticsearch, -XX:+HeapDumpOnOutOfMemoryError, -XX:+PrintGCDetails, -XX:+PrintGCDateStamps, -XX:+PrintTenuringDistribution, -XX:+PrintGCApplicationStoppedTime, -Xloggc:logs/gc.log, -XX:+UseGCLogFileRotation, -XX:NumberOfGCLogFiles=32, -XX:GCLogFileSize=64m, -Delasticsearch, -Des.path.home=c:\elasticsearch, -Des.path.conf=c:\elasticsearch\config, exit, -Xms4098m, -Xmx4098m, -Xss1024k]

Steps I've taken

  1. I've set it in the Environment variable as: ES_HEAP_SIZE with a value of 4g before installing the service

  2. After installation of the service, used the service manager to set the Java options to:
    -Xms4g
    -Xmx4g
    -XX:+UseConcMarkSweepGC
    -XX:CMSInitiatingOccupancyFraction=75
    -XX:+UseCMSInitiatingOccupancyOnly
    -XX:+AlwaysPreTouch
    -Xss1m
    -Djava.awt.headless=true
    -Dfile.encoding=UTF-8
    -Djna.nosys=true
    -XX:-OmitStackTraceInFastThrow
    -Dio.netty.noUnsafe=true
    -Dio.netty.noKeySetOptimization=true
    -Dio.netty.recycler.maxCapacityPerThread=0
    -Dlog4j.shutdownHookEnabled=false
    -Dlog4j2.disable.jmx=true
    -Djava.io.tmpdir=C:\Temp\elasticsearch
    -XX:+HeapDumpOnOutOfMemoryError
    -XX:+PrintGCDetails
    -XX:+PrintGCDateStamps
    -XX:+PrintTenuringDistribution
    -XX:+PrintGCApplicationStoppedTime
    -Xloggc:logs/gc.log
    -XX:+UseGCLogFileRotation
    -XX:NumberOfGCLogFiles=32
    -XX:GCLogFileSize=64m
    -Delasticsearch
    -Des.path.home=c:\elasticsearch
    -Des.path.conf=c:\elasticsearch\config

  3. Set Initial memory pool is: 4098, Maximum memory pool is: 4098. Thread stack size: 1024

  4. Set the config.yml to contain: bootstrap.memory_lock: true

Server specifics
Windows 2012 R2
8GB RAM
ES V 6.2.1
LS: 6.2.1
KB: 6.2.1

Any suggestions welcome!


(Christian Dahlqvist) #2

Can you provide some us statistics from the cluster by running the cluster stats API?


#3

Now that provides some interesting info...
Why is my heap max higher than the available memory I have in the server?

{
  "_nodes" : {
    "total" : 3,
    "successful" : 3,
    "failed" : 0
  },
  "cluster_name" : "gs-elasticsearch",
  "timestamp" : 1518454797203,
  "status" : "yellow",
  "indices" : {
    "count" : 175,
    "shards" : {
      "total" : 871,
      "primaries" : 871,
      "replication" : 0.0,
      "index" : {
        "shards" : {
          "min" : 1,
          "max" : 5,
          "avg" : 4.977142857142857
        },
        "primaries" : {
          "min" : 1,
          "max" : 5,
          "avg" : 4.977142857142857
        },
        "replication" : {
          "min" : 0.0,
          "max" : 0.0,
          "avg" : 0.0
        }
      }
    },
    "docs" : {
      "count" : 40184660,
      "deleted" : 111
    },
    "store" : {
      "size" : "27.7gb",
      "size_in_bytes" : 29793940379
    },
    "fielddata" : {
      "memory_size" : "5.9mb",
      "memory_size_in_bytes" : 6261448,
      "evictions" : 0
    },
    "query_cache" : {
      "memory_size" : "845.5kb",
      "memory_size_in_bytes" : 865889,
      "total_count" : 2320,
      "hit_count" : 420,
      "miss_count" : 1900,
      "cache_size" : 52,
      "cache_count" : 52,
      "evictions" : 0
    },
    "completion" : {
      "size" : "0b",
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 4543,
      "memory" : "127.8mb",
      "memory_in_bytes" : 134092950,
      "terms_memory" : "99.9mb",
      "terms_memory_in_bytes" : 104776149,
      "stored_fields_memory" : "12mb",
      "stored_fields_memory_in_bytes" : 12621136,
      "term_vectors_memory" : "0b",
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory" : "7.9mb",
      "norms_memory_in_bytes" : 8316864,
      "points_memory" : "3.1mb",
      "points_memory_in_bytes" : 3348269,
      "doc_values_memory" : "4.7mb",
      "doc_values_memory_in_bytes" : 5030532,
      "index_writer_memory" : "16.7mb",
      "index_writer_memory_in_bytes" : 17535256,
      "version_map_memory" : "0b",
      "version_map_memory_in_bytes" : 0,
      "fixed_bit_set" : "0b",
      "fixed_bit_set_memory_in_bytes" : 0,
      "max_unsafe_auto_id_timestamp" : 1518452370398,
      "file_sizes" : { }
    }
  },
  "nodes" : {
    "count" : {
      "total" : 3,
      "data" : 3,
      "coordinating_only" : 0,
      "master" : 3,
      "ingest" : 3
    },
    "versions" : [
      "6.2.1"
    ],
    "os" : {
      "available_processors" : 12,
      "allocated_processors" : 12,
      "names" : [
        {
          "name" : "Windows Server 2012 R2",
          "count" : 3
        }
      ],
      "mem" : {
        "total" : "35.9gb",
        "total_in_bytes" : 38652887040,
        "free" : "18gb",
        "free_in_bytes" : 19374075904,
        "used" : "17.9gb",
        "used_in_bytes" : 19278811136,
        "free_percent" : 50,
        "used_percent" : 50
      }
    },
    "process" : {
      "cpu" : {
        "percent" : 48
      },
      "open_file_descriptors" : {
        "min" : -1,
        "max" : -1,
        "avg" : 0
      }
    },
    "jvm" : {
      "max_uptime" : "52.9m",
      "max_uptime_in_millis" : 3176109,
      "versions" : [
        {
          "version" : "1.8.0_144",
          "vm_name" : "Java HotSpot(TM) 64-Bit Server VM",
          "vm_version" : "25.144-b01",
          "vm_vendor" : "Oracle Corporation",
          "count" : 3
        }
      ],
      "mem" : {
        "heap_used" : "1.9gb",
        "heap_used_in_bytes" : 2044156432,
       ** "heap_max" : "11.9gb",**
        "heap_max_in_bytes" : 12786597888
      },
      "threads" : 330
    },
    "fs" : {
      "total" : "299.9gb",
      "total_in_bytes" : 322113097728,
      "free" : "234.4gb",
      "free_in_bytes" : 251707219968,
      "available" : "234.4gb",
      "available_in_bytes" : 251707219968
    },
    "plugins" : [ ],
    "network_types" : {
      "transport_types" : {
        "netty4" : 3
      },
      "http_types" : {
        "netty4" : 3
      }
    }
  }
}

(Christian Dahlqvist) #4

You said you had 3 nodes with 4GB heap each, right? That is a total of 12GB. Are these not running on separate hosts?


#5

I do. My fault, was thinking per node. I have 3 ES nodes, 1 KB server and 1 LS server. So nothing in there that looks "off" to you?

Additional log info when the nodes stay up for a while (usually when there's less logs being thrown at it I assume...

[2018-02-12T19:17:41,159][INFO ][o.e.m.j.JvmGcMonitorService] [es-srv1] [gc][old][5797][37] duration [8.8s], collections [1]/[9.8s], total [8.8s]/[3.6m], memory [3.4gb]->[2.8gb]/[3.9gb], all_pools {[young] [4.3kb]->[1.9mb]/[266.2mb]}{[survivor] [33.2mb]->[0b]/[33.2mb]}{[old] [3.4gb]->[2.7gb]/[3.6gb]}
[2018-02-12T19:17:41,159][WARN ][o.e.m.j.JvmGcMonitorService] [es-srv1] [gc][5797] overhead, spent [9.2s] collecting in the last [9.8s]
[2018-02-12T19:17:42,174][WARN ][o.e.m.j.JvmGcMonitorService] [es-srv1] [gc][5798] overhead, spent [548ms] collecting in the last [1s]
[2018-02-12T19:17:43,190][WARN ][o.e.m.j.JvmGcMonitorService] [es-srv1] [gc][5799] overhead, spent [557ms] collecting in the last [1s]
[2018-02-12T19:17:44,206][INFO ][o.e.m.j.JvmGcMonitorService] [es-srv1] [gc][5800] overhead, spent [388ms] collecting in the last [1s]
[2018-02-12T19:17:51,065][INFO ][o.e.m.j.JvmGcMonitorService] [es-srv1] [gc][old][5801][38] duration [6.4s], collections [1]/[6.8s], total [6.4s]/[3.7m], memory [3.7gb]->[2.1gb]/[3.9gb], all_pools {[young] [130.8mb]->[4.1mb]/[266.2mb]}{[survivor] [33.2mb]->[0b]/[33.2mb]}{[old] [3.5gb]->[2.1gb]/[3.6gb]}
[2018-02-12T19:17:51,065][WARN ][o.e.m.j.JvmGcMonitorService] [es-srv1] [gc][5801] overhead, spent [6.6s] collecting in the last [6.8s]
[2018-02-12T19:17:52,112][INFO ][o.e.m.j.JvmGcMonitorService] [es-srv1] [gc][5802] overhead, spent [440ms] collecting in the last [1s]

#6

Any ideas? I'm at a point where I can't see any other option but wiping out my whole installation and starting again.


(Christian Dahlqvist) #7

Do you have X-Pack monitoring installed so we can get a picture of how heap usage varies over time?


#8

Sorry Christian, I've just replied to you regarding another post I'd made along the same lines. To save you having to search it out, here's the reply:

So I've used Cerebro to get a better overview of the issue. I can now watch the JVM heap size on each node slowly increasing to sometimes 90% on one of the nodes. I can see that I have 42,087,876 docs, 1782 shards spread across 3 nodes and a total size of 58GB, each with now 4GB of JVM heap. Based on the above, can you confirm that the reason I am seeing all these timeouts is down to the sheer number of shards/docs?

If this is the case, what are the steps to reduce them? I only have 179 indices. Is it a config issue that I've screwed up on the initial build?

In terms of heap usage over time. I can watch it climb from 32% to well into the 60% or higher in 5 mins or less in some instances. Having run a Curator cleanup, that usage has decreased but it'll be up again in the next few hours. There are only 5 indexes I've created pushing data into this cluster so I'm slightly confused as to how I'm overloading it!

Thanks.


(system) #9

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.