Gc overhead on startup

Hi,

I'm using Elastic V 5.6.3 on Docker I am using the official Elastic image.
I start the process with -Xms31g and -Xmx31g I hava enough RAM on the host machine (256 GB). I have 5TB of data with 1 600 000 000 docs.

On cluster startup I'm getting gc overhead warnings even if the cluster is idle (no on going search), and when I'm trying to do some intensive search request the gc overhead increases to reach 59 seconds !

Here is and extract of the logs (cluster idle) :

[2017-10-16T08:42:30,274][INFO ][o.e.l.LicenseService     ] [Node0] license [a6b18c87-24e5-4289-b94f-ec7b12c6926a] mode [basic] - valid
[2017-10-16T08:42:32,880][INFO ][o.e.m.j.JvmGcMonitorService] [Node0] [gc][14] overhead, spent [388ms] collecting in the last [1s]
[2017-10-16T08:42:38,236][INFO ][o.e.m.j.JvmGcMonitorService] [Node0] [gc][young][19][3] duration [982ms], collections [1]/[1.3s], total [982ms]/[1.6s], memory [3.3gb]->[2.2gb]/[30.6gb], all_pools {[young] [2.2gb]->[50.4mb]/[2.4gb]}{[survivor] [316.1mb]->[316.1mb]/[316.1mb]}{[old] [875.6mb]->[1.8gb]/[27.9gb]}
[2017-10-16T08:42:38,237][WARN ][o.e.m.j.JvmGcMonitorService] [Node0] [gc][19] overhead, spent [982ms] collecting in the last [1.3s]
[2017-10-16T08:43:17,726][WARN ][o.e.m.j.JvmGcMonitorService] [Node0] [gc][young][58][4] duration [1.3s], collections [1]/[1.4s], total [1.3s]/[2.9s], memory [4.6gb]->[3gb]/[30.6gb], all_pools {[young] [2.4gb]->[22.9mb]/[2.4gb]}{[survivor] [316.1mb]->[316.1mb]/[316.1mb]}{[old] [1.8gb]->[2.6gb]/[27.9gb]}
[2017-10-16T08:43:17,727][WARN ][o.e.m.j.JvmGcMonitorService] [Node0] [gc][58] overhead, spent [1.3s] collecting in the last [1.4s]

Cluster performing intensive search requests :

[2017-10-16T08:17:14,617][WARN ][o.e.m.j.JvmGcMonitorService] [Node0] [gc][258188] overhead, spent [42.3s] collecting in the last [42.4s]
[2017-10-16T08:18:03,941][WARN ][o.e.m.j.JvmGcMonitorService] [Node0] [gc][old][258189][18] duration [49.2s], collections [1]/[49.3s], total [49.2s]/[12.7m], memory [30.6gb]->[30.6gb]/[30.6gb], all_pools {[young] [2.4gb]->[2.4gb]/[2.4gb]}{[survivor] [306.1mb]->[311.1mb]/[316.1mb]}{[old] [27.9gb]->[27.9gb]/[27.9gb]}
[2017-10-16T08:18:03,941][WARN ][o.e.m.j.JvmGcMonitorService] [Node0] [gc][258189] overhead, spent [49.2s] collecting in the last [49.3s]
[2017-10-16T08:18:55,724][WARN ][o.e.m.j.JvmGcMonitorService] [Node0] [gc][old][258190][19] duration [51.7s], collections [1]/[51.7s], total [51.7s]/[13.5m], memory [30.6gb]->[30.6gb]/[30.6gb], all_pools {[young] [2.4gb]->[2.4gb]/[2.4gb]}{[survivor] [311.1mb]->[309mb]/[316.1mb]}{[old] [27.9gb]->[27.9gb]/[27.9gb]}
[2017-10-16T08:18:55,724][WARN ][o.e.m.j.JvmGcMonitorService] [Node0] [gc][258190] overhead, spent [51.7s] collecting in the last [51.7s]
[2017-10-16T08:19:55,350][WARN ][o.e.m.j.JvmGcMonitorService] [Node0] [gc][old][258191][20] duration [59.5s], collections [1]/[59.6s], total [59.5s]/[14.5m], memory [30.6gb]->[30.6gb]/[30.6gb], all_pools {[young] [2.4gb]->[2.4gb]/[2.4gb]}{[survivor] [309mb]->[308.2mb]/[316.1mb]}{[old] [27.9gb]->[27.9gb]/[27.9gb]}
[2017-10-16T08:19:55,351][WARN ][o.e.m.j.JvmGcMonitorService] [Node0] [gc][258191] overhead, spent [59.5s] collecting in the last [59.6s]
[2017-10-16T08:19:55,372][INFO ][o.e.d.z.ZenDiscovery     ] [Node0] master_left [{Node2}{BiklfW9OTk2myZ1vDDwreg}{dQto8egcSDm1AqetpU8NNg}{10.150.232.143}{10.150.232.143:9300}], reason [failed to ping, tried [3] times, each with  maximum [30s] timeout]

Do you have any clue why this is happening ?
Thank you.

How many indices and shards do you have?

10 indices and 5 shard per index. Only 3 indices are quiet big though, the 7 others are for testing purposes only.

curl -XGET 'localhost:9200/_cat/indices'
green open publication                     Qi2jEOTXS3CCsuPsYXU9dQ 5 0 109241672  7386797   2.2tb   2.2tb
green open family                          1C9eHajeQD2snZTAMY9lvA 5 0 713891178 63638498   2.5tb   2.5tb
green open patent                          NLYUfoxnS8KWb_kW1-5Irg 5 0 741077273 53436097 597.4gb 597.4gb
green open f10family                       FTaBNzMkQ2eNGyJjDAY0Xw 5 0    860664        0   3.2gb   3.2gb
green open extractpatent                   _mJ1vfyYR0iAbqUApNYgXg 5 0      3776        0  14.6mb  14.6mb
green open extractpublication              fMdaefBCSMStSDikUHCTIA 5 0      4326        0   4.9mb   4.9mb
green open f10publication                  M89WU4bySrKDttNMAFxDHw 5 0    131007        0   2.9gb   2.9gb
green open extractfamily                   -m011dJBSj6d8g0Kh7_25g 5 0      3548        0  16.1mb  16.1mb
green open f10patent                       cArV_hQIRVyY4g357pKbiw 5 0    893510        0 756.9mb 756.9mb
green open history                         jcU_bdGrQvGP5yKmi3vJPA 5 0      5090        0   3.4mb   3.4mb

The minimum query latency will depend on the shard size, so the fact that you have shards approaching 500GB in size is quite likely to contribute to poor query performance. Exactly how they are linked depends on the data as well as the query patterns.

I would recommend performing a shard sizing benchmark as described in this Elastic{ON} talk.

Thank you for answer, I tried to perform the same search benchmark on a small index, roughly 10 GB of data within a single index (5 shards), I also did close the biggest indices beforehand, and I am still getting gc overheads when performing search requests. 3*31GB of RAM per Node.

[2017-10-16T13:29:06,139][WARN ][o.e.m.j.JvmGcMonitorService] [Node0] [gc][12] overhead, spent [761ms] collecting in the last [1.4s]
[2017-10-16T13:29:10,142][WARN ][o.e.m.j.JvmGcMonitorService] [Node0] [gc][16] overhead, spent [504ms] collecting in the last [1s]
[2017-10-16T13:33:29,229][INFO ][o.e.m.j.JvmGcMonitorService] [Node0] [gc][275] overhead, spent [471ms] collecting in the last [1s]
[2017-10-16T13:33:38,308][INFO ][o.e.m.j.JvmGcMonitorService] [Node0] [gc][284] overhead, spent [299ms] collecting in the last [1s]
[2017-10-16T13:33:46,623][INFO ][o.e.m.j.JvmGcMonitorService] [Node0] [gc][292] overhead, spent [366ms] collecting in the last [1.3s]
[2017-10-16T13:33:58,901][INFO ][o.e.m.j.JvmGcMonitorService] [Node0] [gc][304] overhead, spent [374ms] collecting in the last [1.2s]
[2017-10-16T13:34:09,904][INFO ][o.e.m.j.JvmGcMonitorService] [Node0] [gc][315] overhead, spent [348ms] collecting in the last [1s]
[2017-10-16T13:34:19,256][INFO ][o.e.m.j.JvmGcMonitorService] [Node0] [gc][324] overhead, spent [552ms] collecting in the last [1.3s]
[2017-10-16T13:34:28,313][INFO ][o.e.m.j.JvmGcMonitorService] [Node0] [gc][333] overhead, spent [443ms] collecting in the last [1s]
[2017-10-16T13:34:42,801][INFO ][o.e.m.j.JvmGcMonitorService] [Node0] [gc][347] overhead, spent [681ms] collecting in the last [1.4s]
[2017-10-16T13:34:54,258][INFO ][o.e.m.j.JvmGcMonitorService] [Node0] [gc][358] overhead, spent [484ms] collecting in the last [1.4s]
[2017-10-16T13:35:03,261][INFO ][o.e.m.j.JvmGcMonitorService] [Node0] [gc][367] overhead, spent [355ms] collecting in the last [1s]
[2017-10-16T13:35:11,614][INFO ][o.e.m.j.JvmGcMonitorService] [Node0] [gc][375] overhead, spent [529ms] collecting in the last [1.3s]
[2017-10-16T13:35:22,617][INFO ][o.e.m.j.JvmGcMonitorService] [Node0] [gc][386] overhead, spent [446ms] collecting in the last [1s]
[2017-10-16T13:35:34,620][INFO ][o.e.m.j.JvmGcMonitorService] [Node0] [gc][398] overhead, spent [354ms] collecting in the last [1s]
[2017-10-16T13:36:07,629][INFO ][o.e.m.j.JvmGcMonitorService] [Node0] [gc][431] overhead, spent [287ms] collecting in the last [1s]
[2017-10-16T13:43:13,757][INFO ][o.e.m.j.JvmGcMonitorService] [Node0] [gc][857] overhead, spent [328ms] collecting in the last [1s]

And while the requests script is stuck waiting for response from the cluster I got this stack trace :

[Full GC (Ergonomics) [PSYoungGen: 3611647K->3611647K(7223296K)] [ParOldGen: 21670791K->21670791K(21670912K)] 25282439K->25282439K(28894208K), [Metaspace: 74312K->74312K(1120256K)], 3.1321511 secs] [Times: user=109.47 sys=1.38, real=3.13 secs]
[Full GC (Ergonomics) [PSYoungGen: 3611648K->3611647K(7223296K)] [ParOldGen: 21670804K->21670804K(21670912K)] 25282452K->25282452K(28894208K), [Metaspace: 74312K->74312K(1120256K)], 3.0231559 secs] [Times: user=105.37 sys=1.02, real=3.02 secs]
[Full GC (Ergonomics) [PSYoungGen: 3611647K->3611647K(7223296K)] [ParOldGen: 21670804K->21670804K(21670912K)] 25282452K->25282452K(28894208K), [Metaspace: 74312K->74312K(1120256K)], 3.0955970 secs] [Times: user=104.38 sys=4.30, real=3.09 secs]
[Full GC (Ergonomics) [PSYoungGen: 3611648K->3603844K(7223296K)] [ParOldGen: 21670804K->21670454K(21670912K)] 25282452K->25274299K(28894208K), [Metaspace: 74312K->74217K(1120256K)], 4.6993845 secs] [Times: user=161.42 sys=1.39, real=4.70 secs]
java.lang.OutOfMemoryError: GC overhead limit exceeded
Dumping heap to /usr/share/elasticsearch/data/logs/java_pid1.hprof ...

Can you provide the output of the cluster stats API? This will show statistics, e.g. heap and memory usage.

Thank you for you interest, here is the "jvm" part from the cluster stats API :

"jvm" : {
      "max_uptime" : "3.9d",
      "max_uptime_in_millis" : 338899113,
      "versions" : [
        {
          "version" : "1.8.0_141",
          "vm_name" : "OpenJDK 64-Bit Server VM",
          "vm_version" : "25.141-b16",
          "vm_vendor" : "Oracle Corporation",
          "count" : 3
        },
        {
          "version" : "1.8.0_141",
          "vm_name" : "OpenJDK 64-Bit Server VM",
          "vm_version" : "25.141-b15",
          "vm_vendor" : "Oracle Corporation",
          "count" : 1
        }
      ],
      "mem" : {
        "heap_used" : "24.5gb",
        "heap_used_in_bytes" : 26339364456,
        "heap_max" : "118.8gb",
        "heap_max_in_bytes" : 127665176576
      },
      "threads" : 1034
    }

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.