Node killed after GC doing a Pause Full (G1 Evacuation Pause)

Hello,
We have noticed this. After the GC is doing the Pause Full (G1 Evacuation Pause). The memory is released but the node can not get back in the cluster without a ES Service restart. THE Server replied on the 9200 port but the ES part was not replying. Attached the GC logs.
The cluster had 4 nodes. Each node has 504 shards.

Running on ES 7.13.2, Ubuntu v20, JVM 16, default jdk

-Xms30g
-Xmx30g

14-:-XX:+UseG1GC
14-:-XX:G1ReservePercent=25
14-:-XX:InitiatingHeapOccupancyPercent=30

The node has 64gb ram, SSD.

[2021-07-08T12:03:31.793+0000][1086][gc,start      ] GC(141124) Pause Full (G1 Evacuation Pause)
[2021-07-08T12:03:31.802+0000][1086][gc,phases,start] GC(141124) Phase 1: Mark live objects
[2021-07-08T12:03:31.996+0000][1086][gc,phases      ] GC(141124) Phase 1: Mark live objects 193.883ms
[2021-07-08T12:03:31.996+0000][1086][gc,phases,start] GC(141124) Phase 2: Prepare for compaction
[2021-07-08T12:03:32.067+0000][1086][gc,phases      ] GC(141124) Phase 2: Prepare for compaction 71.152ms
[2021-07-08T12:03:32.067+0000][1086][gc,phases,start] GC(141124) Phase 3: Adjust pointers
[2021-07-08T12:03:32.168+0000][1086][gc,phases      ] GC(141124) Phase 3: Adjust pointers 101.137ms
[2021-07-08T12:03:32.168+0000][1086][gc,phases,start] GC(141124) Phase 4: Compact heap
[2021-07-08T12:03:32.662+0000][1086][gc,phases      ] GC(141124) Phase 4: Compact heap 493.110ms
[2021-07-08T12:03:32.705+0000][1086][gc,heap        ] GC(141124) Eden regions: 0->0(693)
[2021-07-08T12:03:32.705+0000][1086][gc,heap        ] GC(141124) Survivor regions: 0->0(0)
[2021-07-08T12:03:32.705+0000][1086][gc,heap        ] GC(141124) Old regions: 1860->682
[2021-07-08T12:03:32.705+0000][1086][gc,heap        ] GC(141124) Archive regions: 2->2
[2021-07-08T12:03:32.705+0000][1086][gc,heap        ] GC(141124) Humongous regions: 58->37
[2021-07-08T12:03:32.705+0000][1086][gc,metaspace   ] GC(141124) Metaspace: 124071K(125568K)->124068K(125568K) NonClass: 108738K(109568K)->108735K(109568K) Class: 15333K(16000K)->15332K(16000K)
[2021-07-08T12:03:32.705+0000][1086][gc             ] GC(141124) Pause Full (G1 Evacuation Pause) 30392M->11255M(30720M) 912.185ms
[2021-07-08T12:03:32.705+0000][1086][gc,cpu         ] GC(141124) User=17.70s Sys=0.06s Real=0.93s
[2021-07-08T12:03:32.705+0000][1086][safepoint      ] Safepoint "G1CollectForAllocation", Time since last: 1801792 ns, Reaching safepoint: 421106 ns, At safepoint: 945002207 ns, Total: 945423313 ns
[2021-07-08T12:03:32.705+0000][1086][gc,marking     ] GC(141120) Concurrent Rebuild Remembered Sets 1786.856ms
[2021-07-08T12:03:32.705+0000][1086][gc,marking     ] GC(141120) Concurrent Mark Abort
[2021-07-08T12:03:32.705+0000][1086][gc             ] GC(141120) Concurrent Mark Cycle 3248.726ms
[2021-07-08T12:03:32.749+0000][1086][safepoint      ] Safepoint "ICBufferFull", Time since last: 43352922 ns, Reaching safepoint: 744754 ns, At safepoint: 17056 ns, Total: 761810 ns
[2021-07-08T12:03:33.317+0000][1086][safepoint      ] Safepoint "ICBufferFull", Time since last: 567216659 ns, Reaching safepoint: 283036 ns, At safepoint: 16358 ns, Total: 299394 ns
[2021-07-08T12:03:33.317+0000][1086][safepoint      ] Safepoint "ICBufferFull", Time since last: 167361 ns, Reaching safepoint: 144589 ns, At safepoint: 27664 ns, Total: 172253 ns
[2021-07-08T12:03:33.317+0000][1086][safepoint      ] Safepoint "ICBufferFull", Time since last: 51763 ns, Reaching safepoint: 109378 ns, At safepoint: 13468 ns, Total: 122846 ns
[2021-07-08T12:03:33.317+0000][1086][safepoint      ] Safepoint "ICBufferFull", Time since last: 40818 ns, Reaching safepoint: 187398 ns, At safepoint: 14352 ns, Total: 201750 ns
[2021-07-08T12:03:33.375+0000][1086][gc,heap,exit   ] Heap
[2021-07-08T12:03:33.375+0000][1086][gc,heap,exit   ]  garbage-first heap   total 31457280K, used 12279144K [0x0000000080000000, 0x0000000800000000)
[2021-07-08T12:03:33.375+0000][1086][gc,heap,exit   ]   region size 16384K, 47 young (770048K), 0 survivors (0K)
[2021-07-08T12:03:33.375+0000][1086][gc,heap,exit   ]  Metaspace       used 124164K, committed 125632K, reserved 1163264K
[2021-07-08T12:03:33.375+0000][1086][gc,heap,exit   ]   class space    used 15349K, committed 16000K, reserved 1048576K

can you also share logfiles showing how the node tries to join back in the cluster and fails? That would help tremendously with debugging!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.