Hi,
we have a cluster v6.3 (java 1.8) and from time to time we see a problem with the gc on the data nodes.
The rate drops to zero and the jvm heap usage goes up to a 100% (which cause the cluster to stop serving).
The data nodes have are 30gb ram and half is allocated to the heap.
Node configuration:
indices.fielddata.cache.size: 10%
indices.memory.index_buffer_size: 30%
indices.queries.cache.size: 30%
indices.requests.cache.size: 5%
indices.queries.cache.count: 5000000
GC logs:
2018-11-21T06:20:30.382+0000: 74173.669: [CMS-concurrent-mark-start]
2018-11-21T06:20:32.516+0000: 74175.803: [CMS-concurrent-mark: 2.134/2.134 secs] [Times: user=9.62 sys=0.01, real=2.14 secs]
2018-11-21T06:20:32.516+0000: 74175.804: [CMS-concurrent-preclean-start]
2018-11-21T06:20:37.085+0000: 74180.372: [Full GC (Allocation Failure) 2018-11-21T06:20:37.085+0000: 74180.372: [CMS2018-11-21T06:20:37.360+0000: 74180.647: [CMS-concurrent-preclean: 4.842/4.844 secs] [Times: user=7.17 sys=0.05, real=4.84 secs]
(concurrent mode failure): 14326205K->14326207K(14326208K), 15.0557023 secs] 15323005K->14437097K(15323008K), [Metaspace: 84592K->84592K(1130496K)], 15.0559302 secs] [Times: user=15.07 sys=0.00, real=15.06 secs]
2018-11-21T06:20:52.141+0000: 74195.428: Total time for which application threads were stopped: 15.0567299 seconds, Stopping threads took: 0.0001486 seconds
2018-11-21T06:20:54.141+0000: 74197.428: [GC (CMS Initial Mark) [1 CMS-initial-mark: 14326207K(14326208K)] 14749195K(15323008K), 0.0340097 secs] [Times: user=0.34 sys=0.00, real=0.03 secs]
2018-11-21T06:20:54.175+0000: 74197.463: Total time for which application threads were stopped: 0.0348284 seconds, Stopping threads took: 0.0001442 seconds
Jvm usage \ Gc rate graphs: