Gc overhead on startup

iliasse · October 16, 2017, 9:01am

Hi,

I'm using Elastic V 5.6.3 on Docker I am using the official Elastic image.
I start the process with -Xms31g and -Xmx31g I hava enough RAM on the host machine (256 GB). I have 5TB of data with 1 600 000 000 docs.

On cluster startup I'm getting gc overhead warnings even if the cluster is idle (no on going search), and when I'm trying to do some intensive search request the gc overhead increases to reach 59 seconds !

Here is and extract of the logs (cluster idle) :

[2017-10-16T08:42:30,274][INFO ][o.e.l.LicenseService     ] [Node0] license [a6b18c87-24e5-4289-b94f-ec7b12c6926a] mode [basic] - valid
[2017-10-16T08:42:32,880][INFO ][o.e.m.j.JvmGcMonitorService] [Node0] [gc][14] overhead, spent [388ms] collecting in the last [1s]
[2017-10-16T08:42:38,236][INFO ][o.e.m.j.JvmGcMonitorService] [Node0] [gc][young][19][3] duration [982ms], collections [1]/[1.3s], total [982ms]/[1.6s], memory [3.3gb]->[2.2gb]/[30.6gb], all_pools {[young] [2.2gb]->[50.4mb]/[2.4gb]}{[survivor] [316.1mb]->[316.1mb]/[316.1mb]}{[old] [875.6mb]->[1.8gb]/[27.9gb]}
[2017-10-16T08:42:38,237][WARN ][o.e.m.j.JvmGcMonitorService] [Node0] [gc][19] overhead, spent [982ms] collecting in the last [1.3s]
[2017-10-16T08:43:17,726][WARN ][o.e.m.j.JvmGcMonitorService] [Node0] [gc][young][58][4] duration [1.3s], collections [1]/[1.4s], total [1.3s]/[2.9s], memory [4.6gb]->[3gb]/[30.6gb], all_pools {[young] [2.4gb]->[22.9mb]/[2.4gb]}{[survivor] [316.1mb]->[316.1mb]/[316.1mb]}{[old] [1.8gb]->[2.6gb]/[27.9gb]}
[2017-10-16T08:43:17,727][WARN ][o.e.m.j.JvmGcMonitorService] [Node0] [gc][58] overhead, spent [1.3s] collecting in the last [1.4s]

Cluster performing intensive search requests :

[2017-10-16T08:17:14,617][WARN ][o.e.m.j.JvmGcMonitorService] [Node0] [gc][258188] overhead, spent [42.3s] collecting in the last [42.4s]
[2017-10-16T08:18:03,941][WARN ][o.e.m.j.JvmGcMonitorService] [Node0] [gc][old][258189][18] duration [49.2s], collections [1]/[49.3s], total [49.2s]/[12.7m], memory [30.6gb]->[30.6gb]/[30.6gb], all_pools {[young] [2.4gb]->[2.4gb]/[2.4gb]}{[survivor] [306.1mb]->[311.1mb]/[316.1mb]}{[old] [27.9gb]->[27.9gb]/[27.9gb]}
[2017-10-16T08:18:03,941][WARN ][o.e.m.j.JvmGcMonitorService] [Node0] [gc][258189] overhead, spent [49.2s] collecting in the last [49.3s]
[2017-10-16T08:18:55,724][WARN ][o.e.m.j.JvmGcMonitorService] [Node0] [gc][old][258190][19] duration [51.7s], collections [1]/[51.7s], total [51.7s]/[13.5m], memory [30.6gb]->[30.6gb]/[30.6gb], all_pools {[young] [2.4gb]->[2.4gb]/[2.4gb]}{[survivor] [311.1mb]->[309mb]/[316.1mb]}{[old] [27.9gb]->[27.9gb]/[27.9gb]}
[2017-10-16T08:18:55,724][WARN ][o.e.m.j.JvmGcMonitorService] [Node0] [gc][258190] overhead, spent [51.7s] collecting in the last [51.7s]
[2017-10-16T08:19:55,350][WARN ][o.e.m.j.JvmGcMonitorService] [Node0] [gc][old][258191][20] duration [59.5s], collections [1]/[59.6s], total [59.5s]/[14.5m], memory [30.6gb]->[30.6gb]/[30.6gb], all_pools {[young] [2.4gb]->[2.4gb]/[2.4gb]}{[survivor] [309mb]->[308.2mb]/[316.1mb]}{[old] [27.9gb]->[27.9gb]/[27.9gb]}
[2017-10-16T08:19:55,351][WARN ][o.e.m.j.JvmGcMonitorService] [Node0] [gc][258191] overhead, spent [59.5s] collecting in the last [59.6s]
[2017-10-16T08:19:55,372][INFO ][o.e.d.z.ZenDiscovery     ] [Node0] master_left [{Node2}{BiklfW9OTk2myZ1vDDwreg}{dQto8egcSDm1AqetpU8NNg}{10.150.232.143}{10.150.232.143:9300}], reason [failed to ping, tried [3] times, each with  maximum [30s] timeout]

Do you have any clue why this is happening ?
Thank you.

Christian_Dahlqvist · October 16, 2017, 9:50am

How many indices and shards do you have?

iliasse · October 16, 2017, 9:51am

10 indices and 5 shard per index. Only 3 indices are quiet big though, the 7 others are for testing purposes only.

curl -XGET 'localhost:9200/_cat/indices'
green open publication                     Qi2jEOTXS3CCsuPsYXU9dQ 5 0 109241672  7386797   2.2tb   2.2tb
green open family                          1C9eHajeQD2snZTAMY9lvA 5 0 713891178 63638498   2.5tb   2.5tb
green open patent                          NLYUfoxnS8KWb_kW1-5Irg 5 0 741077273 53436097 597.4gb 597.4gb
green open f10family                       FTaBNzMkQ2eNGyJjDAY0Xw 5 0    860664        0   3.2gb   3.2gb
green open extractpatent                   _mJ1vfyYR0iAbqUApNYgXg 5 0      3776        0  14.6mb  14.6mb
green open extractpublication              fMdaefBCSMStSDikUHCTIA 5 0      4326        0   4.9mb   4.9mb
green open f10publication                  M89WU4bySrKDttNMAFxDHw 5 0    131007        0   2.9gb   2.9gb
green open extractfamily                   -m011dJBSj6d8g0Kh7_25g 5 0      3548        0  16.1mb  16.1mb
green open f10patent                       cArV_hQIRVyY4g357pKbiw 5 0    893510        0 756.9mb 756.9mb
green open history                         jcU_bdGrQvGP5yKmi3vJPA 5 0      5090        0   3.4mb   3.4mb

Christian_Dahlqvist · October 16, 2017, 12:56pm

The minimum query latency will depend on the shard size, so the fact that you have shards approaching 500GB in size is quite likely to contribute to poor query performance. Exactly how they are linked depends on the data as well as the query patterns.

I would recommend performing a shard sizing benchmark as described in this Elastic{ON} talk.

iliasse · October 16, 2017, 2:43pm

Thank you for answer, I tried to perform the same search benchmark on a small index, roughly 10 GB of data within a single index (5 shards), I also did close the biggest indices beforehand, and I am still getting gc overheads when performing search requests. 3*31GB of RAM per Node.

[2017-10-16T13:29:06,139][WARN ][o.e.m.j.JvmGcMonitorService] [Node0] [gc][12] overhead, spent [761ms] collecting in the last [1.4s]
[2017-10-16T13:29:10,142][WARN ][o.e.m.j.JvmGcMonitorService] [Node0] [gc][16] overhead, spent [504ms] collecting in the last [1s]
[2017-10-16T13:33:29,229][INFO ][o.e.m.j.JvmGcMonitorService] [Node0] [gc][275] overhead, spent [471ms] collecting in the last [1s]
[2017-10-16T13:33:38,308][INFO ][o.e.m.j.JvmGcMonitorService] [Node0] [gc][284] overhead, spent [299ms] collecting in the last [1s]
[2017-10-16T13:33:46,623][INFO ][o.e.m.j.JvmGcMonitorService] [Node0] [gc][292] overhead, spent [366ms] collecting in the last [1.3s]
[2017-10-16T13:33:58,901][INFO ][o.e.m.j.JvmGcMonitorService] [Node0] [gc][304] overhead, spent [374ms] collecting in the last [1.2s]
[2017-10-16T13:34:09,904][INFO ][o.e.m.j.JvmGcMonitorService] [Node0] [gc][315] overhead, spent [348ms] collecting in the last [1s]
[2017-10-16T13:34:19,256][INFO ][o.e.m.j.JvmGcMonitorService] [Node0] [gc][324] overhead, spent [552ms] collecting in the last [1.3s]
[2017-10-16T13:34:28,313][INFO ][o.e.m.j.JvmGcMonitorService] [Node0] [gc][333] overhead, spent [443ms] collecting in the last [1s]
[2017-10-16T13:34:42,801][INFO ][o.e.m.j.JvmGcMonitorService] [Node0] [gc][347] overhead, spent [681ms] collecting in the last [1.4s]
[2017-10-16T13:34:54,258][INFO ][o.e.m.j.JvmGcMonitorService] [Node0] [gc][358] overhead, spent [484ms] collecting in the last [1.4s]
[2017-10-16T13:35:03,261][INFO ][o.e.m.j.JvmGcMonitorService] [Node0] [gc][367] overhead, spent [355ms] collecting in the last [1s]
[2017-10-16T13:35:11,614][INFO ][o.e.m.j.JvmGcMonitorService] [Node0] [gc][375] overhead, spent [529ms] collecting in the last [1.3s]
[2017-10-16T13:35:22,617][INFO ][o.e.m.j.JvmGcMonitorService] [Node0] [gc][386] overhead, spent [446ms] collecting in the last [1s]
[2017-10-16T13:35:34,620][INFO ][o.e.m.j.JvmGcMonitorService] [Node0] [gc][398] overhead, spent [354ms] collecting in the last [1s]
[2017-10-16T13:36:07,629][INFO ][o.e.m.j.JvmGcMonitorService] [Node0] [gc][431] overhead, spent [287ms] collecting in the last [1s]
[2017-10-16T13:43:13,757][INFO ][o.e.m.j.JvmGcMonitorService] [Node0] [gc][857] overhead, spent [328ms] collecting in the last [1s]

And while the requests script is stuck waiting for response from the cluster I got this stack trace :

[Full GC (Ergonomics) [PSYoungGen: 3611647K->3611647K(7223296K)] [ParOldGen: 21670791K->21670791K(21670912K)] 25282439K->25282439K(28894208K), [Metaspace: 74312K->74312K(1120256K)], 3.1321511 secs] [Times: user=109.47 sys=1.38, real=3.13 secs]
[Full GC (Ergonomics) [PSYoungGen: 3611648K->3611647K(7223296K)] [ParOldGen: 21670804K->21670804K(21670912K)] 25282452K->25282452K(28894208K), [Metaspace: 74312K->74312K(1120256K)], 3.0231559 secs] [Times: user=105.37 sys=1.02, real=3.02 secs]
[Full GC (Ergonomics) [PSYoungGen: 3611647K->3611647K(7223296K)] [ParOldGen: 21670804K->21670804K(21670912K)] 25282452K->25282452K(28894208K), [Metaspace: 74312K->74312K(1120256K)], 3.0955970 secs] [Times: user=104.38 sys=4.30, real=3.09 secs]
[Full GC (Ergonomics) [PSYoungGen: 3611648K->3603844K(7223296K)] [ParOldGen: 21670804K->21670454K(21670912K)] 25282452K->25274299K(28894208K), [Metaspace: 74312K->74217K(1120256K)], 4.6993845 secs] [Times: user=161.42 sys=1.39, real=4.70 secs]
java.lang.OutOfMemoryError: GC overhead limit exceeded
Dumping heap to /usr/share/elasticsearch/data/logs/java_pid1.hprof ...

Christian_Dahlqvist · October 16, 2017, 8:38pm

Can you provide the output of the cluster stats API? This will show statistics, e.g. heap and memory usage.

iliasse · October 17, 2017, 7:19am

Thank you for you interest, here is the "jvm" part from the cluster stats API :

"jvm" : {
      "max_uptime" : "3.9d",
      "max_uptime_in_millis" : 338899113,
      "versions" : [
        {
          "version" : "1.8.0_141",
          "vm_name" : "OpenJDK 64-Bit Server VM",
          "vm_version" : "25.141-b16",
          "vm_vendor" : "Oracle Corporation",
          "count" : 3
        },
        {
          "version" : "1.8.0_141",
          "vm_name" : "OpenJDK 64-Bit Server VM",
          "vm_version" : "25.141-b15",
          "vm_vendor" : "Oracle Corporation",
          "count" : 1
        }
      ],
      "mem" : {
        "heap_used" : "24.5gb",
        "heap_used_in_bytes" : 26339364456,
        "heap_max" : "118.8gb",
        "heap_max_in_bytes" : 127665176576
      },
      "threads" : 1034
    }

system · November 14, 2017, 7:19am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Constant GC warnings with small indexes Elasticsearch	4	300	December 10, 2021
Elasticsearch gc overhead Elasticsearch	1	1265	March 23, 2020
Elasticsearch gc overhead Elasticsearch docker	2	547	June 23, 2021
Indexing taking a lot of time due to GC overhead Elasticsearch	7	2832	May 18, 2017
High HEAP and CPU utilization due to long and inefficient Garbage Collector (GC) Elasticsearch docker	6	1244	March 7, 2023

Gc overhead on startup

Related topics