GC allocation failure causing shard to fail

In Elastic log I can see master (node-2) removing node-3 and adding multiple times.
At around the same time, I see long GC cycles on node-3.

 [2020-01-17T01:33:49,904][INFO ][o.e.c.s.ClusterApplierService] [node-2] removed {{node-3}{5t-Y05xkTU2TEAWn126Rwg}{8lRQU65YRoK1AkZ1P299dQ}{node-3}{9.151.141.3:9300}{ml.machine_memory=1097878700032, ml.max
_open_jobs=20, xpack.installed=true},}, term: 290, version: 3285, reason: ApplyCommitRequest{term=290, version=3285, sourceNode={node-1}{-_UntR0aSFee7J0uGBsXvw}{NgPgPx4nRSGsuxenifYBFg}{node-1}{9.151.141.1
:9300}{ml.machine_memory=1097878695936, ml.max_open_jobs=20, xpack.installed=true}}

[2020-01-17T01:33:59.708+0000][4005][gc           ] GC(38) Pause Young (Allocation Failure) 216M->79M(494M) 27889.991ms



[2020-01-17T01:43:50,918][INFO ][o.e.c.s.ClusterApplierService] [node-2] removed {{node-3}{5t-Y05xkTU2TEAWn126Rwg}{8lRQU65YRoK1AkZ1P299dQ}{node-3}{9.151.141.3:9300}{ml.machine_memory=1097878700032, ml.max
_open_jobs=20, xpack.installed=true},}, term: 290, version: 3292, reason: ApplyCommitRequest{term=290, version=3292, sourceNode={node-1}{-_UntR0aSFee7J0uGBsXvw}{NgPgPx4nRSGsuxenifYBFg}{node-1}{9.151.141.1
:9300}{ml.machine_memory=1097878695936, ml.max_open_jobs=20, xpack.installed=true}}

[2020-01-17T01:43:51.894+0000][4005][gc           ] GC(41) Pause Young (Allocation Failure) 219M->84M(494M) 18547.871ms



[2020-01-17T02:09:00,608][INFO ][o.e.c.s.ClusterApplierService] [node-2] removed {{node-3}{5t-Y05xkTU2TEAWn126Rwg}{8lRQU65YRoK1AkZ1P299dQ}{node-3}{9.151.141.3:9300}{ml.machine_memory=1097878700032, ml.max
_open_jobs=20, xpack.installed=true},}, term: 290, version: 3315, reason: ApplyCommitRequest{term=290, version=3315, sourceNode={node-1}{-_UntR0aSFee7J0uGBsXvw}{NgPgPx4nRSGsuxenifYBFg}{node-1}{9.151.141.1
:9300}{ml.machine_memory=1097878695936, ml.max_open_jobs=20, xpack.installed=true}}


[2020-01-17T02:10:36.413+0000][4005][gc           ] GC(47) Pause Young (Allocation Failure) 218M->82M(494M) 112874.834ms


[2020-01-17T02:20:34,991][INFO ][o.e.c.s.ClusterApplierService] [node-2] removed {{node-3}{5t-Y05xkTU2TEAWn126Rwg}{8lRQU65YRoK1AkZ1P299dQ}{node-3}{9.151.141.3:9300}{ml.machine_memory=1097878700032, ml.max
_open_jobs=20, xpack.installed=true},}, term: 290, version: 3322, reason: ApplyCommitRequest{term=290, version=3322, sourceNode={node-1}{-_UntR0aSFee7J0uGBsXvw}{NgPgPx4nRSGsuxenifYBFg}{node-1}{9.151.141.1
:9300}{ml.machine_memory=1097878695936, ml.max_open_jobs=20, xpack.installed=true}}

[2020-01-17T02:22:26.047+0000][4005][gc           ] GC(50) Pause Young (Allocation Failure) 219M->88M(494M) 128482.688ms

In addition, after these errors, shard marked as failed on node-3

[2020-01-17T01:44:08,502][WARN ][o.e.i.c.IndicesClusterStateService] [node-3] [events_1578091228978][0] marking and sending shard failed due to [failed to create shard]
java.io.IOException: failed to obtain in-memory shard lock

This problem didn't occur before with the same exact settings.
In addition the same GC cycles performed very quickly before and after the errors

2020-01-17T01:29:02.898+0000][4005][gc           ] GC(37) Pause Young (Allocation Failure) 215M->79M(494M) 1.181ms
[2020-01-17T01:29:02.898+0000][4005][gc,cpu       ] GC(37) User=0.01s Sys=0.00s Real=0.00s


[2020-01-17T01:39:41.745+0000][4005][gc           ] GC(40) Pause Young (Allocation Failure) 218M->82M(494M) 1.560ms
[2020-01-17T01:39:41.745+0000][4005][gc,cpu       ] GC(40) User=0.01s Sys=0.00s Real=0.00s


[2020-01-17T02:03:51.473+0000][4005][gc           ] GC(46) Pause Young (Allocation Failure) 218M->82M(494M) 1.380ms
[2020-01-17T02:03:51.473+0000][4005][gc,cpu       ] GC(46) User=0.01s Sys=0.01s Real=0.00s


[2020-01-17T02:16:20.786+0000][4005][gc           ] GC(49) Pause Young (Allocation Failure) 218M->82M(494M) 1.377ms
[2020-01-17T02:16:20.786+0000][4005][gc,cpu       ] GC(49) User=0.01s Sys=0.00s Real=0.00s

What can I do? (Assuming, heap size cannot be changed)

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.