Strange young GC pauses with ES 1.3 and Java 8

I'm seeing a strange pattern in my cluster. Every thirty minutes, I see a spike in response times. Matching the times to the GC logs, I see the long responses correlate with long young GC's. Often there will be several GC's in a row:

[2015-07-30 16:05:34,767][INFO ][monitor.jvm              ] [bazaar-polloi-agrippa-ironmand-elasticsearch-31a7a4e2] [gc][young][574478][514018] duration [904ms], collections [1]/[1.3s], total [904ms]/[4.6h], memory [17.8gb]->[18.2gb]/[28gb], all_pools {[young] [221.9kb]->[7.3mb]/[1.4gb]}{[survivor] [127.7mb]->[191.3mb]/[191.3mb]}{[old] [17.6gb]->[18gb]/[26.3gb]}
[2015-07-30 16:05:35,787][INFO ][monitor.jvm              ] [bazaar-polloi-agrippa-ironmand-elasticsearch-31a7a4e2] [gc][young][574479][514019] duration [891ms], collections [1]/[1s], total [891ms]/[4.6h], memory [18.2gb]->[18.6gb]/[28gb], all_pools {[young] [7.3mb]->[88.9mb]/[1.4gb]}{[survivor] [191.3mb]->[191.3mb]/[191.3mb]}{[old] [18gb]->[18.3gb]/[26.3gb]}
[2015-07-30 16:05:39,494][WARN ][monitor.jvm              ] [bazaar-polloi-agrippa-ironmand-elasticsearch-31a7a4e2] [gc][young][574482][514024] duration [1.5s], collections [1]/[1.7s], total [1.5s]/[4.6h], memory [19.3gb]->[19.6gb]/[28gb], all_pools {[young] [313.1mb]->[16.6mb]/[1.4gb]}{[survivor] [87.1mb]->[191.3mb]/[191.3mb]}{[old] [18.9gb]->[19.4gb]/[26.3gb]}
[2015-07-30 16:05:40,495][INFO ][monitor.jvm              ] [bazaar-polloi-agrippa-ironmand-elasticsearch-31a7a4e2] [gc][young][574483][514025] duration [804ms], collections [1]/[1s], total [804ms]/[4.6h], memory [19.6gb]->[20.4gb]/[28gb], all_pools {[young] [16.6mb]->[705.7mb]/[1.4gb]}{[survivor] [191.3mb]->[99.3mb]/[191.3mb]}{[old] [19.4gb]->[19.6gb]/[26.3gb]}
[2015-07-30 16:05:45,106][WARN ][monitor.jvm              ] [bazaar-polloi-agrippa-ironmand-elasticsearch-31a7a4e2] [gc][young][574486][514032] duration [2.1s], collections [1]/[2.6s], total [2.1s]/[4.6h], memory [19.8gb]->[20.5gb]/[28gb], all_pools {[young] [140.8mb]->[3.6mb]/[1.4gb]}{[survivor] [80mb]->[191.3mb]/[191.3mb]}{[old] [19.6gb]->[20.3gb]/[26.3gb]}
[2015-07-30 16:05:46,106][INFO ][monitor.jvm              ] [bazaar-polloi-agrippa-ironmand-elasticsearch-31a7a4e2] [gc][young][574487][514033] duration [763ms], collections [1]/[1s], total [763ms]/[4.6h], memory [20.5gb]->[21.6gb]/[28gb], all_pools {[young] [3.6mb]->[950.7mb]/[1.4gb]}{[survivor] [191.3mb]->[191.3mb]/[191.3mb]}{[old] [20.3gb]->[20.5gb]/[26.3gb]}
[2015-07-30 16:35:37,196][WARN ][monitor.jvm              ] [bazaar-polloi-agrippa-ironmand-elasticsearch-31a7a4e2] [gc][young][576274][516618] duration [1.7s], collections [1]/[2s], total [1.7s]/[4.6h], memory [12.6gb]->[13.4gb]/[28gb], all_pools {[young] [813.5kb]->[18.9mb]/[1.4gb]}{[survivor] [107.6mb]->[191.3mb]/[191.3mb]}{[old] [12.5gb]->[13.2gb]/[26.3gb]}

My cluster has 3 master nodes, 5 client nodes, and 27 data nodes. The data nodes have 28G heap. It's running Elasticsearch 1.3 and OpenJDK 1.8.

Here's what confuses me:

  • Why would it take 1s+ for a young GC?
  • Why would there be so many GC's in a row?
  • Is there anything internal to ES that might cause this, on such a regular basis? I stopped indexing for an hour, and still saw these spikes. But our queries are so irregular, that I can't imagine any pattern there causing this. It happens every thirty minutes, at 5 and 35 past the hour, and has done so for weeks.

What sort of monitoring do have in place, eg Marvel?
Having a broader view of things can help correlate GCs with other operations happening in the cluster.

We don't have Marvel, but we do have Datadog monitoring, so we get these metrics: Datadog Elasticsearch Integration. I've tried to correlate and found nothing interesting. Most of my ideas were related to new indexing, and I ruled those out by stopping indexing.

From running jstat -gccause in a loop, I see that Eden space goes to 100% used, then within 100ms the empty survivor space jumps to 100%. It stays like that for about a second, then the GC ends. Sometimes it repeats immediately.

I believe I've read that the duration of a young GC depends on the size of live objects. So, this suggests to me that ES is creating some large objects all at once. Probably more than can fit in Survivor space. That would explain the long GC's and the repeated GC's.

But it's still not clear what the trigger could be.

We are seeing the same behavior, multiple young GC happening in a row in some of the nodes taking around 700-800ms and for few others taking ~1s.
Why does it take so long for young GC? It definitely causes a spike in response time.
Any help would be appreciated

2016-07-20T16:30:49.552Z INFO [es-client-16][scheduler][T#1]] [monitor.jvm ]: [es-client-16] [gc][young][70474][2604] duration [720ms], collections [1]/[1s], total [720ms]/[10.7m], memory [5.5gb]->[5.8gb]/[11.9gb], all_pools {[young] [194.7mb]->[251.5mb]/[266.2mb]}{[survivor] [33.2mb]->[33.2mb]/[33.2mb]}{[old] [5.3gb]->[5.5gb]/[11.6gb]}
2016-07-20T16:30:50.552Z INFO [es-client-16][scheduler][T#1]] [monitor.jvm ]: [es-client-16] [gc][young][70475][2605] duration [721ms], collections [1]/[1s], total [721ms]/[10.8m], memory [5.8gb]->[5.9gb]/[11.9gb], all_pools {[young] [251.5mb]->[148.8mb]/[266.2mb]}{[survivor] [33.2mb]->[33.2mb]/[33.2mb]}{[old] [5.5gb]->[5.7gb]/[11.6gb]}
2016-07-20T16:50:47.720Z INFO [es-client-16][scheduler][T#1]] [monitor.jvm ]: [es-client-16] [gc][young][71670][2650] duration [743ms], collections [1]/[1s], total [743ms]/[10.9m], memory [8.5gb]->[8.8gb]/[11.9gb], all_pools {[young] [75.7mb]->[178.5mb]/[266.2mb]}{[survivor] [1.4mb]->[33.2mb]/[33.2mb]}{[old] [8.4gb]->[8.6gb]/[11.6gb]}
2016-07-20T17:45:34.982Z INFO [es-client-16][scheduler][T#1]] [monitor.jvm ]: [es-client-16] [gc][young][74955][2741] duration [769ms], collections [1]/[1s], total [769ms]/[11.1m], memory [3.1gb]->[3.3gb]/[11.9gb], all_pools {[young] [118.1mb]->[91mb]/[266.2mb]}{[survivor] [33.2mb]->[33.2mb]/[33.2mb]}{[old] [3gb]->[3.2gb]/[11.6gb]}
2016-07-20T17:45:35.983Z INFO [es-client-16][scheduler][T#1]] [monitor.jvm ]: [es-client-16] [gc][young][74956][2742] duration [759ms], collections [1]/[1s], total [759ms]/[11.1m], memory [3.3gb]->[3.5gb]/[11.9gb], all_pools {[young] [91mb]->[48.6mb]/[266.2mb]}{[survivor] [33.2mb]->[33.2mb]/[33.2mb]}{[old] [3.2gb]->[3.4gb]/[11.6gb]}
2016-07-20T17:45:36.986Z INFO [es-client-16][scheduler][T#1]] [monitor.jvm ]: [es-client-16] [gc][young][74957][2743] duration [711ms], collections [1]/[1s], total [711ms]/[11.1m], memory [3.5gb]->[3.7gb]/[11.9gb], all_pools {[young] [48.6mb]->[61.8mb]/[266.2mb]}{[survivor] [33.2mb]->[33.2mb]/[33.2mb]}{[old] [3.4gb]->[3.6gb]/[11.6gb]}
2016-07-20T17:45:37.986Z INFO [es-client-16][scheduler][T#1]] [monitor.jvm ]: [es-client-16] [gc][young][74958][2744] duration [714ms], collections [1]/[1s], total [714ms]/[11.1m], memory [3.7gb]->[3.9gb]/[11.9gb], all_pools {[young] [61.8mb]->[37.1mb]/[266.2mb]}{[survivor] [33.2mb]->[33.2mb]/[33.2mb]}{[old] [3.6gb]->[3.8gb]/[11.6gb]}

[1103] duration [1.3s], collections [1]/[1.5s], total [1.3s]/[53.1s], memory [222.2mb]->[326.5mb]/[11.9gb], all_pools {[young] [96.1mb]->[414kb]/[266.2mb]}{[survivor] [807.1kb]->[33.2mb]/[33.2mb]}{[old] [125.3mb]->[292.9mb]/[11.6gb]}
2016-07-20T12:55:16.819Z WARN [es-client-20][scheduler][T#1]] [monitor.jvm ]: [es-client-20] [gc][young][57706][1130] duration [1.2s], collections [1]/[2.1s], total [1.2s]/[57.5s], memory [1gb]->[1.2gb]/[11.9gb], all_pools {[young] [33.7mb]->[2.1mb]/[266.2mb]}{[survivor] [1.2mb]->[33.2mb]/[33.2mb]}{[old] [1gb]->[1.1gb]/[11.6gb]}
2016-07-20T14:55:43.505Z WARN [es-client-20][scheduler][T#1]] [monitor.jvm ]: [es-client-20] [gc][young][64930][1281] duration [1.2s], collections [1]/[1.2s], total [1.2s]/[1.1m], memory [2.6gb]->[2.5gb]/[11.9gb], all_pools {[young] [241.6mb]->[473.6kb]/[266.2mb]}{[survivor] [29.2mb]->[33.2mb]/[33.2mb]}{[old] [2.3gb]->[2.5gb]/[11.6gb]}
2016-07-20T15:15:18.413Z INFO [es-client-20][scheduler][T#1]] [monitor.jvm ]: [es-client-20] [gc][young][66103][1312] duration [1.5s], collections [2]/[2.4s], total [1.5s]/[1.2m], memory [4.4gb]->[4.3gb]/[11.9gb], all_pools {[young] [261.9mb]->[6.5mb]/[266.2mb]}{[survivor] [1.7mb]->[33.2mb]/[33.2mb]}{[old] [4.1gb]->[4.3gb]/[11.6gb]}