I'm seeing a strange pattern in my cluster. Every thirty minutes, I see a spike in response times. Matching the times to the GC logs, I see the long responses correlate with long young GC's. Often there will be several GC's in a row:
[2015-07-30 16:05:34,767][INFO ][monitor.jvm ] [bazaar-polloi-agrippa-ironmand-elasticsearch-31a7a4e2] [gc][young][574478][514018] duration [904ms], collections [1]/[1.3s], total [904ms]/[4.6h], memory [17.8gb]->[18.2gb]/[28gb], all_pools {[young] [221.9kb]->[7.3mb]/[1.4gb]}{[survivor] [127.7mb]->[191.3mb]/[191.3mb]}{[old] [17.6gb]->[18gb]/[26.3gb]}
[2015-07-30 16:05:35,787][INFO ][monitor.jvm ] [bazaar-polloi-agrippa-ironmand-elasticsearch-31a7a4e2] [gc][young][574479][514019] duration [891ms], collections [1]/[1s], total [891ms]/[4.6h], memory [18.2gb]->[18.6gb]/[28gb], all_pools {[young] [7.3mb]->[88.9mb]/[1.4gb]}{[survivor] [191.3mb]->[191.3mb]/[191.3mb]}{[old] [18gb]->[18.3gb]/[26.3gb]}
[2015-07-30 16:05:39,494][WARN ][monitor.jvm ] [bazaar-polloi-agrippa-ironmand-elasticsearch-31a7a4e2] [gc][young][574482][514024] duration [1.5s], collections [1]/[1.7s], total [1.5s]/[4.6h], memory [19.3gb]->[19.6gb]/[28gb], all_pools {[young] [313.1mb]->[16.6mb]/[1.4gb]}{[survivor] [87.1mb]->[191.3mb]/[191.3mb]}{[old] [18.9gb]->[19.4gb]/[26.3gb]}
[2015-07-30 16:05:40,495][INFO ][monitor.jvm ] [bazaar-polloi-agrippa-ironmand-elasticsearch-31a7a4e2] [gc][young][574483][514025] duration [804ms], collections [1]/[1s], total [804ms]/[4.6h], memory [19.6gb]->[20.4gb]/[28gb], all_pools {[young] [16.6mb]->[705.7mb]/[1.4gb]}{[survivor] [191.3mb]->[99.3mb]/[191.3mb]}{[old] [19.4gb]->[19.6gb]/[26.3gb]}
[2015-07-30 16:05:45,106][WARN ][monitor.jvm ] [bazaar-polloi-agrippa-ironmand-elasticsearch-31a7a4e2] [gc][young][574486][514032] duration [2.1s], collections [1]/[2.6s], total [2.1s]/[4.6h], memory [19.8gb]->[20.5gb]/[28gb], all_pools {[young] [140.8mb]->[3.6mb]/[1.4gb]}{[survivor] [80mb]->[191.3mb]/[191.3mb]}{[old] [19.6gb]->[20.3gb]/[26.3gb]}
[2015-07-30 16:05:46,106][INFO ][monitor.jvm ] [bazaar-polloi-agrippa-ironmand-elasticsearch-31a7a4e2] [gc][young][574487][514033] duration [763ms], collections [1]/[1s], total [763ms]/[4.6h], memory [20.5gb]->[21.6gb]/[28gb], all_pools {[young] [3.6mb]->[950.7mb]/[1.4gb]}{[survivor] [191.3mb]->[191.3mb]/[191.3mb]}{[old] [20.3gb]->[20.5gb]/[26.3gb]}
[2015-07-30 16:35:37,196][WARN ][monitor.jvm ] [bazaar-polloi-agrippa-ironmand-elasticsearch-31a7a4e2] [gc][young][576274][516618] duration [1.7s], collections [1]/[2s], total [1.7s]/[4.6h], memory [12.6gb]->[13.4gb]/[28gb], all_pools {[young] [813.5kb]->[18.9mb]/[1.4gb]}{[survivor] [107.6mb]->[191.3mb]/[191.3mb]}{[old] [12.5gb]->[13.2gb]/[26.3gb]}
My cluster has 3 master nodes, 5 client nodes, and 27 data nodes. The data nodes have 28G heap. It's running Elasticsearch 1.3 and OpenJDK 1.8.
Here's what confuses me:
- Why would it take 1s+ for a young GC?
- Why would there be so many GC's in a row?
- Is there anything internal to ES that might cause this, on such a regular basis? I stopped indexing for an hour, and still saw these spikes. But our queries are so irregular, that I can't imagine any pattern there causing this. It happens every thirty minutes, at 5 and 35 past the hour, and has done so for weeks.