Debug Old GC after startup and on indexing (40 second hangs every other minute)


since a few days we experience huge GC issues with ElasticSearch. On indexing new documents or sometimes even after startup we get around 40 second GC pauses for old gen. This happens once a minute, effectively shutting down ES. The GC seems not to be able to clean up the memory though, so I assume it actually uses it.

[2016-02-04 07:42:54,595][WARN ][monitor.jvm              ] [Zarek] [gc][old][172180][11511] duration [36.2s], collections [1]/[37.4s], total [36.2s]/[3h], memory [12.9gb]->[12.1gb]/[13.8gb], all_pools {[young] [861.2mb]->[44.2mb]/[1.4gb]}{[survivor] [118.6mb]->[0b]/[191.3mb]}{[old] [12gb]->[12gb]/[12.1gb]}
[2016-02-04 07:43:28,502][WARN ][monitor.jvm              ] [Zarek] [gc][old][172181][11512] duration [33.6s], collections [1]/[33.9s], total [33.6s]/[3h], memory [12.1gb]->[12.1gb]/[13.8gb], all_pools {[young] [44.2mb]->[38.6mb]/[1.4gb]}{[survivor] [0b]->[0b]/[191.3mb]}{[old] [12gb]->[12.1gb]/[12.1gb]}
[2016-02-04 07:44:02,924][WARN ][monitor.jvm              ] [Zarek] [gc][old][172182][11513] duration [34.1s], collections [1]/[34.4s], total [34.1s]/[3h], memory [12.1gb]->[12.3gb]/[13.8gb], all_pools {[young] [38.6mb]->[271.1mb]/[1.4gb]}{[survivor] [0b]->[0b]/[191.3mb]}{[old] [12.1gb]->[12.1gb]/[12.1gb]}

It is one node, 32 cores with 28GB total memory and 14GB heap. When I check the cluster stats the memory used by caches is only 100MB (id) and the only other memory profile is the segments which is listed as "3074610792".

The memory seems to be filling up already when the nodes starts. When we are indexing we are indexing at quite a high rate, there are up to 200MB/s written to the disk for longer periods. However, in most cases this runs without a hickup for over an hour, but sometimes the above GC pauses happen.

Has someone an idea what this could be an how this can be further debugged?