Strange young GC pauses with ES 1.3 and Java 8

thebearmayor · July 30, 2015, 8:28pm

I'm seeing a strange pattern in my cluster. Every thirty minutes, I see a spike in response times. Matching the times to the GC logs, I see the long responses correlate with long young GC's. Often there will be several GC's in a row:

[2015-07-30 16:05:34,767][INFO ][monitor.jvm              ] [bazaar-polloi-agrippa-ironmand-elasticsearch-31a7a4e2] [gc][young][574478][514018] duration [904ms], collections [1]/[1.3s], total [904ms]/[4.6h], memory [17.8gb]->[18.2gb]/[28gb], all_pools {[young] [221.9kb]->[7.3mb]/[1.4gb]}{[survivor] [127.7mb]->[191.3mb]/[191.3mb]}{[old] [17.6gb]->[18gb]/[26.3gb]}
[2015-07-30 16:05:35,787][INFO ][monitor.jvm              ] [bazaar-polloi-agrippa-ironmand-elasticsearch-31a7a4e2] [gc][young][574479][514019] duration [891ms], collections [1]/[1s], total [891ms]/[4.6h], memory [18.2gb]->[18.6gb]/[28gb], all_pools {[young] [7.3mb]->[88.9mb]/[1.4gb]}{[survivor] [191.3mb]->[191.3mb]/[191.3mb]}{[old] [18gb]->[18.3gb]/[26.3gb]}
[2015-07-30 16:05:39,494][WARN ][monitor.jvm              ] [bazaar-polloi-agrippa-ironmand-elasticsearch-31a7a4e2] [gc][young][574482][514024] duration [1.5s], collections [1]/[1.7s], total [1.5s]/[4.6h], memory [19.3gb]->[19.6gb]/[28gb], all_pools {[young] [313.1mb]->[16.6mb]/[1.4gb]}{[survivor] [87.1mb]->[191.3mb]/[191.3mb]}{[old] [18.9gb]->[19.4gb]/[26.3gb]}
[2015-07-30 16:05:40,495][INFO ][monitor.jvm              ] [bazaar-polloi-agrippa-ironmand-elasticsearch-31a7a4e2] [gc][young][574483][514025] duration [804ms], collections [1]/[1s], total [804ms]/[4.6h], memory [19.6gb]->[20.4gb]/[28gb], all_pools {[young] [16.6mb]->[705.7mb]/[1.4gb]}{[survivor] [191.3mb]->[99.3mb]/[191.3mb]}{[old] [19.4gb]->[19.6gb]/[26.3gb]}
[2015-07-30 16:05:45,106][WARN ][monitor.jvm              ] [bazaar-polloi-agrippa-ironmand-elasticsearch-31a7a4e2] [gc][young][574486][514032] duration [2.1s], collections [1]/[2.6s], total [2.1s]/[4.6h], memory [19.8gb]->[20.5gb]/[28gb], all_pools {[young] [140.8mb]->[3.6mb]/[1.4gb]}{[survivor] [80mb]->[191.3mb]/[191.3mb]}{[old] [19.6gb]->[20.3gb]/[26.3gb]}
[2015-07-30 16:05:46,106][INFO ][monitor.jvm              ] [bazaar-polloi-agrippa-ironmand-elasticsearch-31a7a4e2] [gc][young][574487][514033] duration [763ms], collections [1]/[1s], total [763ms]/[4.6h], memory [20.5gb]->[21.6gb]/[28gb], all_pools {[young] [3.6mb]->[950.7mb]/[1.4gb]}{[survivor] [191.3mb]->[191.3mb]/[191.3mb]}{[old] [20.3gb]->[20.5gb]/[26.3gb]}
[2015-07-30 16:35:37,196][WARN ][monitor.jvm              ] [bazaar-polloi-agrippa-ironmand-elasticsearch-31a7a4e2] [gc][young][576274][516618] duration [1.7s], collections [1]/[2s], total [1.7s]/[4.6h], memory [12.6gb]->[13.4gb]/[28gb], all_pools {[young] [813.5kb]->[18.9mb]/[1.4gb]}{[survivor] [107.6mb]->[191.3mb]/[191.3mb]}{[old] [12.5gb]->[13.2gb]/[26.3gb]}

My cluster has 3 master nodes, 5 client nodes, and 27 data nodes. The data nodes have 28G heap. It's running Elasticsearch 1.3 and OpenJDK 1.8.

Here's what confuses me:

Why would it take 1s+ for a young GC?
Why would there be so many GC's in a row?
Is there anything internal to ES that might cause this, on such a regular basis? I stopped indexing for an hour, and still saw these spikes. But our queries are so irregular, that I can't imagine any pattern there causing this. It happens every thirty minutes, at 5 and 35 past the hour, and has done so for weeks.

warkolm · July 31, 2015, 1:46am

What sort of monitoring do have in place, eg Marvel?
Having a broader view of things can help correlate GCs with other operations happening in the cluster.

thebearmayor · July 31, 2015, 12:21pm

We don't have Marvel, but we do have Datadog monitoring, so we get these metrics: Datadog Elasticsearch Integration. I've tried to correlate and found nothing interesting. Most of my ideas were related to new indexing, and I ruled those out by stopping indexing.

thebearmayor · July 31, 2015, 8:22pm

From running jstat -gccause in a loop, I see that Eden space goes to 100% used, then within 100ms the empty survivor space jumps to 100%. It stays like that for about a second, then the GC ends. Sometimes it repeats immediately.

I believe I've read that the duration of a young GC depends on the size of live objects. So, this suggests to me that ES is creating some large objects all at once. Probably more than can fit in Survivor space. That would explain the long GC's and the repeated GC's.

But it's still not clear what the trigger could be.

anupamamann · July 20, 2016, 10:48pm

We are seeing the same behavior, multiple young GC happening in a row in some of the nodes taking around 700-800ms and for few others taking ~1s.
Why does it take so long for young GC? It definitely causes a spike in response time.
Any help would be appreciated

2016-07-20T16:30:49.552Z INFO [es-client-16][scheduler][T#1]] [monitor.jvm ]: [es-client-16] [gc][young][70474][2604] duration [720ms], collections [1]/[1s], total [720ms]/[10.7m], memory [5.5gb]->[5.8gb]/[11.9gb], all_pools {[young] [194.7mb]->[251.5mb]/[266.2mb]}{[survivor] [33.2mb]->[33.2mb]/[33.2mb]}{[old] [5.3gb]->[5.5gb]/[11.6gb]}
2016-07-20T16:30:50.552Z INFO [es-client-16][scheduler][T#1]] [monitor.jvm ]: [es-client-16] [gc][young][70475][2605] duration [721ms], collections [1]/[1s], total [721ms]/[10.8m], memory [5.8gb]->[5.9gb]/[11.9gb], all_pools {[young] [251.5mb]->[148.8mb]/[266.2mb]}{[survivor] [33.2mb]->[33.2mb]/[33.2mb]}{[old] [5.5gb]->[5.7gb]/[11.6gb]}
2016-07-20T16:50:47.720Z INFO [es-client-16][scheduler][T#1]] [monitor.jvm ]: [es-client-16] [gc][young][71670][2650] duration [743ms], collections [1]/[1s], total [743ms]/[10.9m], memory [8.5gb]->[8.8gb]/[11.9gb], all_pools {[young] [75.7mb]->[178.5mb]/[266.2mb]}{[survivor] [1.4mb]->[33.2mb]/[33.2mb]}{[old] [8.4gb]->[8.6gb]/[11.6gb]}
2016-07-20T17:45:34.982Z INFO [es-client-16][scheduler][T#1]] [monitor.jvm ]: [es-client-16] [gc][young][74955][2741] duration [769ms], collections [1]/[1s], total [769ms]/[11.1m], memory [3.1gb]->[3.3gb]/[11.9gb], all_pools {[young] [118.1mb]->[91mb]/[266.2mb]}{[survivor] [33.2mb]->[33.2mb]/[33.2mb]}{[old] [3gb]->[3.2gb]/[11.6gb]}
2016-07-20T17:45:35.983Z INFO [es-client-16][scheduler][T#1]] [monitor.jvm ]: [es-client-16] [gc][young][74956][2742] duration [759ms], collections [1]/[1s], total [759ms]/[11.1m], memory [3.3gb]->[3.5gb]/[11.9gb], all_pools {[young] [91mb]->[48.6mb]/[266.2mb]}{[survivor] [33.2mb]->[33.2mb]/[33.2mb]}{[old] [3.2gb]->[3.4gb]/[11.6gb]}
2016-07-20T17:45:36.986Z INFO [es-client-16][scheduler][T#1]] [monitor.jvm ]: [es-client-16] [gc][young][74957][2743] duration [711ms], collections [1]/[1s], total [711ms]/[11.1m], memory [3.5gb]->[3.7gb]/[11.9gb], all_pools {[young] [48.6mb]->[61.8mb]/[266.2mb]}{[survivor] [33.2mb]->[33.2mb]/[33.2mb]}{[old] [3.4gb]->[3.6gb]/[11.6gb]}
2016-07-20T17:45:37.986Z INFO [es-client-16][scheduler][T#1]] [monitor.jvm ]: [es-client-16] [gc][young][74958][2744] duration [714ms], collections [1]/[1s], total [714ms]/[11.1m], memory [3.7gb]->[3.9gb]/[11.9gb], all_pools {[young] [61.8mb]->[37.1mb]/[266.2mb]}{[survivor] [33.2mb]->[33.2mb]/[33.2mb]}{[old] [3.6gb]->[3.8gb]/[11.6gb]}

[1103] duration [1.3s], collections [1]/[1.5s], total [1.3s]/[53.1s], memory [222.2mb]->[326.5mb]/[11.9gb], all_pools {[young] [96.1mb]->[414kb]/[266.2mb]}{[survivor] [807.1kb]->[33.2mb]/[33.2mb]}{[old] [125.3mb]->[292.9mb]/[11.6gb]}
2016-07-20T12:55:16.819Z WARN [es-client-20][scheduler][T#1]] [monitor.jvm ]: [es-client-20] [gc][young][57706][1130] duration [1.2s], collections [1]/[2.1s], total [1.2s]/[57.5s], memory [1gb]->[1.2gb]/[11.9gb], all_pools {[young] [33.7mb]->[2.1mb]/[266.2mb]}{[survivor] [1.2mb]->[33.2mb]/[33.2mb]}{[old] [1gb]->[1.1gb]/[11.6gb]}
2016-07-20T14:55:43.505Z WARN [es-client-20][scheduler][T#1]] [monitor.jvm ]: [es-client-20] [gc][young][64930][1281] duration [1.2s], collections [1]/[1.2s], total [1.2s]/[1.1m], memory [2.6gb]->[2.5gb]/[11.9gb], all_pools {[young] [241.6mb]->[473.6kb]/[266.2mb]}{[survivor] [29.2mb]->[33.2mb]/[33.2mb]}{[old] [2.3gb]->[2.5gb]/[11.6gb]}
2016-07-20T15:15:18.413Z INFO [es-client-20][scheduler][T#1]] [monitor.jvm ]: [es-client-20] [gc][young][66103][1312] duration [1.5s], collections [2]/[2.4s], total [1.5s]/[1.2m], memory [4.4gb]->[4.3gb]/[11.9gb], all_pools {[young] [261.9mb]->[6.5mb]/[266.2mb]}{[survivor] [1.7mb]->[33.2mb]/[33.2mb]}{[old] [4.1gb]->[4.3gb]/[11.6gb]}

Topic		Replies	Views
Young GC inconsistent durations Elasticsearch	1	820	January 5, 2017
Long GC pauses with ES 1.3.4 Elasticsearch	12	1502	July 5, 2017
Debug Old GC after startup and on indexing (40 second hangs every other minute) Elasticsearch	1	635	July 5, 2017
ES v1.1 continuous young gc pauses old gc, stops the world when old gc happens and splits cluster Elasticsearch	15	1315	July 6, 2017
Very long gc pause caused by ES Elasticsearch	6	1035	July 5, 2017

Strange young GC pauses with ES 1.3 and Java 8

Related topics