Elasticsearch node fails after multiple G1GC triggers

We have a 6 node cluster on ES 7.17.7, which is very stable except for our single coordinating node (which also is master eligible, so not a "true" coordinating node).

After 4-6 hours of uptime, we always get the below issue on this node, which seems to be related to memory/garbage collections. We have not correlated it with any external activity (such as large search requests, etc.).

Thoughts on how to track down root cause and how to prevent this issue?

[2022-11-12T11:11:54,188][INFO ][o.e.n.Node               ] [elastic-util1-coordinating] started
[2022-11-12T11:21:38,276][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [elastic-util1-coordinating] attempting to trigger G1GC due to high heap usage [4138769936]
[2022-11-12T11:21:38,295][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [elastic-util1-coordinating] GC did bring memory usage down, before [4138769936], after [267450384], allocations [17], duration [18]
[2022-11-12T12:25:26,483][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [elastic-util1-coordinating] attempting to trigger G1GC due to high heap usage [4066731536]
[2022-11-12T12:25:26,507][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [elastic-util1-coordinating] GC did bring memory usage down, before [4066731536], after [363190288], allocations [40], duration [25]
[2022-11-12T12:51:22,809][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [elastic-util1-coordinating] attempting to trigger G1GC due to high heap usage [4127475224]
[2022-11-12T12:51:22,828][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [elastic-util1-coordinating] GC did bring memory usage down, before [4127475224], after [354905104], allocations [29], duration [19]
..............similar lines about G1GC triggers "repeat about 100 times"..............
[2022-11-12T14:57:03,626][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [elastic-util1-coordinating] attempting to trigger G1GC due to high heap usage [4081830728]
[2022-11-12T14:57:03,640][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [elastic-util1-coordinating] GC did bring memory usage down, before [4081830728], after [245514296], allocations [18], duration [14]
[2022-11-12T14:57:38,298][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [elastic-util1-coordinating] attempting to trigger G1GC due to high heap usage [4141845696]
[2022-11-12T14:57:38,315][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [elastic-util1-coordinating] GC did bring memory usage down, before [4141845696], after [316259416], allocations [16], duration [17]
[2022-11-12T15:02:17,080][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [elastic-util1-coordinating] attempting to trigger G1GC due to high heap usage [4084792656]
[2022-11-12T15:02:17,099][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [elastic-util1-coordinating] GC did bring memory usage down, before [4084792656], after [269889648], allocations [19], duration [19]
[2022-11-12T15:03:24,567][INFO ][o.e.c.c.Coordinator      ] [elastic-util1-coordinating] [3] consecutive checks of the master node [{elastic0}{GUgiGIKSSKii3RXKZ6DEMw}{d7aJqMRGQYSJ4cLLRHC2Tg}{10.69.5.24}{10.69.5.24:9300}{cdfhimrstw}] were unsuccessful ([0] rejected, [3] timed out), restarting discovery; more details may be available in the master node logs [last unsuccessful check: [elastic0][10.69.5.24:9300][internal:coordination/fault_detection/leader_check] request_id [377597] timed out after [10030ms]]
[2022-11-12T15:03:24,655][INFO ][o.e.c.s.ClusterApplierService] [elastic-util1-coordinating] master node changed {previous [{elastic0}{GUgiGIKSSKii3RXKZ6DEMw}{d7aJqMRGQYSJ4cLLRHC2Tg}{10.69.5.24}{10.69.5.24:9300}{cdfhimrstw}], current []}, term: 312, version: 860457, reason: becoming candidate: onLeaderFailure
[2022-11-12T15:03:34,656][WARN ][o.e.c.c.ClusterFormationFailureHelper] [elastic-util1-coordinating] master not discovered or elected yet, an election requires at least 2 nodes with ids from [b7Kj8A4zRJC9tAYRlSCdSg, GUgiGIKSSKii3RXKZ6DEMw, 40EjWfPPQbOspToxtuUzEg], have only discovered non-quorum [{elastic-util1-coordinating}{40EjWfPPQbOspToxtuUzEg}{iCWCzpgGQHOyfoDaaV0t4Q}{10.69.5.40}{10.69.5.40:9300}{m}]; discovery will continue using [10.69.5.24:9300, 10.69.5.20:9300] from hosts providers and [{elastic-util1-coordinating}{40EjWfPPQbOspToxtuUzEg}{iCWCzpgGQHOyfoDaaV0t4Q}{10.69.5.40}{10.69.5.40:9300}{m}, {elastic0}{GUgiGIKSSKii3RXKZ6DEMw}{d7aJqMRGQYSJ4cLLRHC2Tg}{10.69.5.24}{10.69.5.24:9300}{cdfhimrstw}, {elastic1}{b7Kj8A4zRJC9tAYRlSCdSg}{AE3dEw3rSmazTv0dFgYFJA}{10.69.5.20}{10.69.5.20:9300}{cdfhimrstw}] from last-known cluster state; node term 312, last-accepted version 860457 in term 312
[2022-11-12T15:03:44,659][WARN ][o.e.c.c.ClusterFormationFailureHelper] [elastic-util1-coordinating] master not discovered or elected yet, an election requires at least 2 nodes with ids from [b7Kj8A4zRJC9tAYRlSCdSg, GUgiGIKSSKii3RXKZ6DEMw, 40EjWfPPQbOspToxtuUzEg], have only discovered non-quorum [{elastic-util1-coordinating}{40EjWfPPQbOspToxtuUzEg}{iCWCzpgGQHOyfoDaaV0t4Q}{10.69.5.40}{10.69.5.40:9300}{m}]; discovery will continue using [10.69.5.24:9300, 10.69.5.20:9300] from hosts providers and [{elastic-util1-coordinating}{40EjWfPPQbOspToxtuUzEg}{iCWCzpgGQHOyfoDaaV0t4Q}{10.69.5.40}{10.69.5.40:9300}{m}, {elastic0}{GUgiGIKSSKii3RXKZ6DEMw}{d7aJqMRGQYSJ4cLLRHC2Tg}{10.69.5.24}{10.69.5.24:9300}{cdfhimrstw}, {elastic1}{b7Kj8A4zRJC9tAYRlSCdSg}{AE3dEw3rSmazTv0dFgYFJA}{10.69.5.20}{10.69.5.20:9300}{cdfhimrstw}] from last-known cluster state; node term 312, last-accepted version 860457 in term 312
[2022-11-12T15:03:54,662][WARN ][o.e.c.c.ClusterFormationFailureHelper] [elastic-util1-coordinating] master not discovered or elected yet, an election requires at least 2 nodes with ids from [b7Kj8A4zRJC9tAYRlSCdSg, GUgiGIKSSKii3RXKZ6DEMw, 40EjWfPPQbOspToxtuUzEg], have only discovered non-quorum [{elastic-util1-coordinating}{40EjWfPPQbOspToxtuUzEg}{iCWCzpgGQHOyfoDaaV0t4Q}{10.69.5.40}{10.69.5.40:9300}{m}]; discovery will continue using [10.69.5.24:9300, 10.69.5.20:9300] from hosts providers and [{elastic-util1-coordinating}{40EjWfPPQbOspToxtuUzEg}{iCWCzpgGQHOyfoDaaV0t4Q}{10.69.5.40}{10.69.5.40:9300}{m}, {elastic0}{GUgiGIKSSKii3RXKZ6DEMw}{d7aJqMRGQYSJ4cLLRHC2Tg}{10.69.5.24}{10.69.5.24:9300}{cdfhimrstw}, {elastic1}{b7Kj8A4zRJC9tAYRlSCdSg}{AE3dEw3rSmazTv0dFgYFJA}{10.69.5.20}{10.69.5.20:9300}{cdfhimrstw}] from last-known cluster state; node term 312, last-accepted version 860457 in term 312
[2022-11-12T15:04:04,667][WARN ][o.e.c.c.ClusterFormationFailureHelper] [elastic-util1-coordinating] master not discovered or elected yet, an election requires at least 2 nodes with ids from [b7Kj8A4zRJC9tAYRlSCdSg, GUgiGIKSSKii3RXKZ6DEMw, 40EjWfPPQbOspToxtuUzEg], have only discovered non-quorum [{elastic-util1-coordinating}{40EjWfPPQbOspToxtuUzEg}{iCWCzpgGQHOyfoDaaV0t4Q}{10.69.5.40}{10.69.5.40:9300}{m}]; discovery will continue using [10.69.5.24:9300, 10.69.5.20:9300] from hosts providers and [{elastic-util1-coordinating}{40EjWfPPQbOspToxtuUzEg}{iCWCzpgGQHOyfoDaaV0t4Q}{10.69.5.40}{10.69.5.40:9300}{m}, {elastic0}{GUgiGIKSSKii3RXKZ6DEMw}{d7aJqMRGQYSJ4cLLRHC2Tg}{10.69.5.24}{10.69.5.24:9300}{cdfhimrstw}, {elastic1}{b7Kj8A4zRJC9tAYRlSCdSg}{AE3dEw3rSmazTv0dFgYFJA}{10.69.5.20}{10.69.5.20:9300}{cdfhimrstw}] from last-known cluster state; node term 312, last-accepted version 860457 in term 312
[2022-11-12T15:04:14,670][WARN ][o.e.c.c.ClusterFormationFailureHelper] [elastic-util1-coordinating] master not discovered or elected yet, an election requires at least 2 nodes with ids from [b7Kj8A4zRJC9tAYRlSCdSg, GUgiGIKSSKii3RXKZ6DEMw, 40EjWfPPQbOspToxtuUzEg], have only discovered non-quorum [{elastic-util1-coordinating}{40EjWfPPQbOspToxtuUzEg}{iCWCzpgGQHOyfoDaaV0t4Q}{10.69.5.40}{10.69.5.40:9300}{m}]; discovery will continue using [10.69.5.24:9300, 10.69.5.20:9300] from hosts providers and [{elastic-util1-coordinating}{40EjWfPPQbOspToxtuUzEg}{iCWCzpgGQHOyfoDaaV0t4Q}{10.69.5.40}{10.69.5.40:9300}{m}, {elastic0}{GUgiGIKSSKii3RXKZ6DEMw}{d7aJqMRGQYSJ4cLLRHC2Tg}{10.69.5.24}{10.69.5.24:9300}{cdfhimrstw}, {elastic1}{b7Kj8A4zRJC9tAYRlSCdSg}{AE3dEw3rSmazTv0dFgYFJA}{10.69.5.20}{10.69.5.20:9300}{cdfhimrstw}] from last-known cluster state; node term 312, last-accepted version 860457 in term 312
[2022-11-12T15:04:24,674][WARN ][o.e.c.c.ClusterFormationFailureHelper] [elastic-util1-coordinating] master not discovered or elected yet, an election requires at least 2 nodes with ids from [b7Kj8A4zRJC9tAYRlSCdSg, GUgiGIKSSKii3RXKZ6DEMw, 40EjWfPPQbOspToxtuUzEg], have only discovered non-quorum [{elastic-util1-coordinating}{40EjWfPPQbOspToxtuUzEg}{iCWCzpgGQHOyfoDaaV0t4Q}{10.69.5.40}{10.69.5.40:9300}{m}]; discovery will continue using [10.69.5.24:9300, 10.69.5.20:9300] from hosts providers and [{elastic-util1-coordinating}{40EjWfPPQbOspToxtuUzEg}{iCWCzpgGQHOyfoDaaV0t4Q}{10.69.5.40}{10.69.5.40:9300}{m}, {elastic0}{GUgiGIKSSKii3RXKZ6DEMw}{d7aJqMRGQYSJ4cLLRHC2Tg}{10.69.5.24}{10.69.5.24:9300}{cdfhimrstw}, {elastic1}{b7Kj8A4zRJC9tAYRlSCdSg}{AE3dEw3rSmazTv0dFgYFJA}{10.69.5.20}{10.69.5.20:9300}{cdfhimrstw}] from last-known cluster state; node term 312, last-accepted version 860457 in term 312
....these "master not discovered" lines repeat forever, and the node never re-discovers the master......

Here are my cluster stats (AFTER restarting the coordinating node)...

Here are my cluster stats when the coordinating node has crashed/left the cluster

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.