Heap sudenly full on all voting-only master nodes (7.17.28)

nisow95612 · June 25, 2025, 6:30pm

I have no idea what happened. No clue in node logs.
These nodes are data_cold,master,voting_only and they are very idle, GC happening like every three hours.

Suddenly heap on both nodes filled and they crashed.

2025-06-25 logs of first node

Elastic:

[2025-06-25T19:21:59,987][WARN ][o.e.m.j.JvmGcMonitorService] [COLD1] [gc][1991139] overhead, spent [5.2s] collecting in the last [5.4s]
[2025-06-25T19:22:00,556][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD1] attempting to trigger G1GC due to high heap usage [31975780480]
[2025-06-25T19:22:00,679][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD1] attempt to trigger young GC failed to bring memory down, triggering full GC
[2025-06-25T19:22:00,936][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD1] GC did not bring memory usage down, before [31975780480], after [32002497128], allocations [68], duration [380]
[2025-06-25T19:22:01,013][INFO ][o.e.m.j.JvmGcMonitorService] [COLD1] [gc][1991140] overhead, spent [271ms] collecting in the last [1s]
[2025-06-25T19:22:08,809][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD1] attempting to trigger G1GC due to high heap usage [33109914656]
[2025-06-25T19:22:08,809][WARN ][o.e.t.ThreadPool         ] [COLD1] timer thread slept for [6.6s/6672ms] on absolute clock which is above the warn threshold of [5000ms]
[2025-06-25T19:22:08,820][WARN ][o.e.t.ThreadPool         ] [COLD1] timer thread slept for [6.6s/6672162398ns] on relative clock which is above the warn threshold of [5000ms]
[2025-06-25T19:22:08,821][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD1] attempt to trigger young GC failed to bring memory down, triggering full GC
[2025-06-25T19:22:09,085][WARN ][o.e.m.j.JvmGcMonitorService] [COLD1] [gc][1991142] overhead, spent [6.5s] collecting in the last [6.7s]
[2025-06-25T19:22:09,086][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD1] GC did bring memory usage down, before [33109914656], after [33106726224], allocations [1], duration [276]
[2025-06-25T19:22:16,298][WARN ][o.e.t.ThreadPool         ] [COLD1] timer thread slept for [6.7s/6738ms] on absolute clock which is above the warn threshold of [5000ms]
[2025-06-25T19:22:16,299][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD1] attempting to trigger G1GC due to high heap usage [33261247848]
[2025-06-25T19:22:16,300][WARN ][o.e.t.ThreadPool         ] [COLD1] timer thread slept for [6.7s/6738267801ns] on relative clock which is above the warn threshold of [5000ms]
[2025-06-25T19:22:37,672][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD1] could not acquire lock within 500ms when attempting to trigger G1GC due to high heap usage
[2025-06-25T19:22:37,674][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD1] memory usage not down after [0], before [33261247848], after [33261247848]
[2025-06-25T19:22:37,675][WARN ][o.e.m.j.JvmGcMonitorService] [COLD1] [gc][1991143] overhead, spent [7.2s] collecting in the last [7.4s]
[2025-06-25T19:22:37,872][WARN ][o.e.t.ThreadPool         ] [COLD1] timer thread slept for [21.5s/21574ms] on absolute clock which is above the warn threshold of [5000ms]
[2025-06-25T19:22:37,873][WARN ][o.e.t.ThreadPool         ] [COLD1] timer thread slept for [21.5s/21574291266ns] on relative clock which is above the warn threshold of [5000ms]
[2025-06-25T19:22:38,172][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD1] could not acquire lock within 500ms when attempting to trigger G1GC due to high heap usage
[2025-06-25T19:22:38,173][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD1] could not acquire lock within 500ms when attempting to trigger G1GC due to high heap usage
[2025-06-25T19:22:38,173][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD1] memory usage not down after [0], before [33261247848], after [33261247848]
[2025-06-25T19:22:38,174][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD1] memory usage not down after [0], before [33261247848], after [33261247848]
[2025-06-25T19:22:38,178][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD1] could not acquire lock within 500ms when attempting to trigger G1GC due to high heap usage
[2025-06-25T19:22:38,179][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD1] memory usage not down after [0], before [33261247848], after [33261247848]
[2025-06-25T19:22:38,196][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD1] could not acquire lock within 500ms when attempting to trigger G1GC due to high heap usage
[2025-06-25T19:22:38,197][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD1] memory usage not down after [0], before [33261247848], after [33261247848]
[2025-06-25T19:22:41,056][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD1] could not acquire lock within 500ms when attempting to trigger G1GC due to high heap usage
[2025-06-25T19:22:41,057][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD1] memory usage not down after [0], before [33261247848], after [33261247848]
[2025-06-25T19:22:42,756][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD1] could not acquire lock within 500ms when attempting to trigger G1GC due to high heap usage
[2025-06-25T19:22:42,756][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD1] memory usage not down after [0], before [33261247848], after [33261247848]

Systemd:

Jun 25 19:22:16 cold1 systemd-entrypoint[29333]: java.lang.OutOfMemoryError: Java heap space
Jun 25 19:22:16 cold1 systemd-entrypoint[29333]: Dumping heap to /var/lib/elasticsearch/java_pid29333.hprof ...
Jun 25 19:22:55 cold1 systemd-entrypoint[29333]: Heap dump file created [28691299403 bytes in 39.631 secs]
Jun 25 19:22:55 cold1 systemd-entrypoint[29333]: Terminating due to java.lang.OutOfMemoryError: Java heap space
Jun 25 19:22:58 cold1 systemd[1]: elasticsearch.service: Main process exited, code=exited, status=3/NOTIMPLEMENTED
Jun 25 19:22:58 cold1 systemd[1]: elasticsearch.service: Failed with result 'exit-code'.

2025-06-25 logs of second node

Elastic:

[2025-06-25T19:22:00,556][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD2] attempting to trigger G1GC due to high heap usage [31911675176]
[2025-06-25T19:22:00,564][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD2] GC did bring memory usage down, before [31911675176], after [31896038912], allocations [0], duration [8]
[2025-06-25T19:22:00,937][INFO ][o.e.m.j.JvmGcMonitorService] [COLD2] [gc][1991144] overhead, spent [366ms] collecting in the last [1.1s]
[2025-06-25T19:22:06,970][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD2] attempting to trigger G1GC due to high heap usage [31922278808]
[2025-06-25T19:22:06,983][WARN ][o.e.t.ThreadPool         ] [COLD2] timer thread slept for [6s/6034ms] on absolute clock which is above the warn threshold of [5000ms]
[2025-06-25T19:22:06,984][WARN ][o.e.t.ThreadPool         ] [COLD2] timer thread slept for [6s/6034459557ns] on relative clock which is above the warn threshold of [5000ms]
[2025-06-25T19:22:06,985][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD2] attempt to trigger young GC failed to bring memory down, triggering full GC
[2025-06-25T19:22:07,252][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD2] GC did not bring memory usage down, before [31922278808], after [31972651264], allocations [1], duration [281]
[2025-06-25T19:22:07,254][WARN ][o.e.m.j.JvmGcMonitorService] [COLD2] [gc][1991145] overhead, spent [5.9s] collecting in the last [6s]
[2025-06-25T19:22:09,506][INFO ][o.e.m.j.JvmGcMonitorService] [COLD2] [gc][1991147] overhead, spent [525ms] collecting in the last [1.2s]
[2025-06-25T19:22:17,303][WARN ][o.e.t.ThreadPool         ] [COLD2] timer thread slept for [7.3s/7380ms] on absolute clock which is above the warn threshold of [5000ms]
[2025-06-25T19:22:17,303][WARN ][o.e.t.ThreadPool         ] [COLD2] timer thread slept for [7.3s/7380310416ns] on relative clock which is above the warn threshold of [5000ms]
[2025-06-25T19:22:17,303][WARN ][o.e.m.j.JvmGcMonitorService] [COLD2] [gc][1991148] overhead, spent [7.7s] collecting in the last [7.7s]
[2025-06-25T19:22:17,303][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD2] attempting to trigger G1GC due to high heap usage [33242973680]
[2025-06-25T19:22:17,848][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD2] attempt to trigger young GC failed to bring memory down, triggering full GC
[2025-06-25T19:22:17,848][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD2] could not acquire lock within 500ms when attempting to trigger G1GC due to high heap usage
[2025-06-25T19:22:18,067][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD2] GC did not bring memory usage down, before [33242973680], after [33242997328], allocations [1], duration [764]
[2025-06-25T19:22:18,068][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD2] memory usage not down after [0], before [33242973680], after [33242997328]
[2025-06-25T19:22:17,848][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD2] could not acquire lock within 500ms when attempting to trigger G1GC due to high heap usage
[2025-06-25T19:22:18,069][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD2] memory usage not down after [0], before [33242973680], after [33242997328]
[2025-06-25T19:22:18,071][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD2] memory usage not down after [223], before [33242997328], after [33242997328]

Systemd:

Jun 25 19:22:26 cold2 systemd-entrypoint[30961]: java.lang.OutOfMemoryError: Java heap space
Jun 25 19:22:26 cold2 systemd-entrypoint[30961]: Dumping heap to /var/lib/elasticsearch/java_pid30961.hprof ...
Jun 25 19:23:13 cold2 systemd-entrypoint[30961]: Heap dump file created [28882639542 bytes in 46.862 secs]
Jun 25 19:23:13 cold2 systemd-entrypoint[30961]: Terminating due to java.lang.OutOfMemoryError: Java heap space
Jun 25 19:23:16 cold2 systemd[1]: elasticsearch.service: Main process exited, code=exited, status=3/NOTIMPLEMENTED
Jun 25 19:23:16 cold2 systemd[1]: elasticsearch.service: Failed with result 'exit-code'.

Update: changed role of nodes to only data_cold and restarted elasticsearch.
This was no help because nodes crashed same way again.

Update:

heap_used_percent for cold nodes every minute

COLD1 heap_used_percent,COLD2 heap_used_percent
63,67
63,67
63,68
64,68
64,68
64,68
64,69
65,69
65,69
65,70
66,70
66,70
66,70
66,71
67,71
67,71
67,71
67,72
68,72
68,72
68,73
69,73
69,73
69,73
69,74
70,74
70,74
70,74
70,75
71,75
71,75
71,75
71,76
72,76
72,76
72,77
73,77
73,77
73,77
73,78
74,78
74,78
74,78
74,79
75,79
75,79
75,22
75,23
76,23
76,23
76,23
77,24
77,24
77,24
77,24
78,25
78,25
78,25
78,26
79,26
22,26
23,26
23,27
23,27
23,27
24,27
24,28
24,28
24,28
25,28
25,29
25,29
25,29
26,30
26,30
26,30
27,30
27,31
27,31
27,31
28,32
28,32
28,32
29,32
29,33
29,33
29,33
30,33
30,34
30,34
30,34
31,34
31,35
31,35
31,35
32,36
32,36
32,36
33,36
33,37
33,37
33,37
34,37
34,38
34,38
35,38
35,38
35,39
35,39
36,39
36,40
36,40
36,40
37,40
37,41
37,41
37,41
38,42
38,42
38,42
38,42
39,43
39,43
39,43
40,43
40,44
40,44
40,44
41,44
41,45
42,45
42,46
42,46
42,46
43,46
43,47
43,47
43,47
44,47
44,48
44,48
45,48
45,49
45,49
45,49
46,50
46,50
46,50
47,50
47,51
47,51
47,51
48,51
48,52
48,52
49,52
49,52
49,53
49,53
50,53
50,54
50,54
50,54
51,54
51,55
51,55
51,55
52,55
52,56
52,56
53,56
53,57
53,57
53,57
54,57
54,58
54,58
54,58
55,58
55,59
55,59
56,59
56,60
56,60
56,60
57,60
57,61
57,61
57,61
58,61
58,62
58,62
58,62
59,63
59,63
59,63
60,63
60,64
60,64
60,64
61,64
61,65
61,65
62,65
62,65
62,66
62,66
63,66
63,67
63,67
63,67
64,67
64,68
64,68
64,68
65,68
65,69
65,69
66,69
66,70
66,70
66,70
67,70
67,71
67,71
67,71
68,71
68,72
68,72
68,72
69,72
69,73
69,73
70,73
70,74
70,74
70,74
71,74
71,75
71,75
72,75
72,75
72,76
72,76
73,76
73,76
73,77
73,77
74,77
74,78
74,78
74,78
75,78
75,79
75,79
75,79
76,22
76,22
76,23
77,23
77,23
77,24
77,24
78,24
78,24
78,25
79,25
79,25
22,25
22,26
23,26
23,26
23,27
24,27
24,27
24,27
24,28
25,28
25,28
25,28
25,29
26,29
26,29
26,29
27,30
27,30
27,30
27,31
28,31
28,31
28,31
28,32
29,32
29,32
29,33
30,33
30,33
30,33
30,34
31,34
31,34
31,34
31,35
32,35
32,35
32,35
33,36
33,36
33,36
33,37
34,37
34,37
34,37
34,38
35,38
35,38
35,38
35,39
36,39
36,39
36,40
37,40
37,40
37,40
37,41
38,41
38,41
38,41
38,42
39,42
39,42
39,43
40,43
40,43
40,43
40,44
41,44
41,44
41,44
41,45
42,45
42,45
42,45
43,46
43,46
43,46
43,46
44,47
44,47
44,47
44,48
45,48
45,48
45,48
45,49
46,49
46,49
46,49
47,50
47,50
47,50
47,50
48,51
48,51
48,51
48,52
49,52
49,52
49,52
50,53
50,53
50,53
50,53
51,54
51,54
51,54
51,54
52,55
52,55
52,55
52,56
53,56
53,56
53,56
54,57
54,57
54,57
54,57
55,58
55,58
55,58
55,59
56,59
56,59
56,59
56,60
57,60
57,60
57,60
58,61
58,61
58,61
58,62
59,62
59,62
59,62
59,63
60,63
60,63
60,63
60,64
61,64
61,64
61,64
62,65
62,65
62,65
62,66
63,66
63,66
63,66
64,67
64,67
64,67
64,67
65,68
65,68
65,68
65,69
66,69
66,69
66,69
66,70
67,70
67,70
67,70
68,71
68,71
68,71
68,71
69,72
69,72
69,72
69,73
70,73
70,73
70,73
71,74
71,74
71,74
71,74
72,75
72,75
72,75
72,76
73,76
73,76
73,76
73,77
74,77
74,77
74,77
75,78
75,78
75,78
75,78
76,79
76,79
76,79
76,22
77,23
77,23
77,23
78,23
78,24
78,24
78,24
79,24
79,25
22,25
23,25
23,26
23,26
23,26
24,26
24,27
24,27
25,27
25,27
25,28
25,28
26,28
26,28
26,29
26,29
26,29
27,30
27,30
27,30
28,30
28,31
28,31
29,31
29,31
29,32
29,32
30,32
30,33
30,33
31,33
31,33
31,34
31,34
32,34
32,34
32,35
32,35
33,35
33,35
33,36
33,36
34,36
34,37
34,37
35,37
35,37
35,38
35,38
36,38
36,38
36,39
36,39
37,39
37,40
37,40
38,40
38,40
38,41
38,41
39,41
39,41
39,42
39,42
40,42
40,42
40,43
40,43
41,43
41,43
41,44
42,44
42,44
42,45
42,45
43,45
43,45
43,46
43,46
44,46
44,46
45,47
45,47
45,48
45,48
46,48
46,48
46,49
47,49
47,49
47,50
47,50
48,50
48,50
48,51
48,51
49,51
49,52
49,52
50,52
50,52
50,53
50,53
51,53
51,53
51,54
51,54
52,54
52,54
52,55
53,55
53,55
53,56
53,56
54,56
54,56
54,57
54,57
55,57
55,57
55,58
56,58
56,58
56,59
56,59
57,59
57,59
57,60
57,60
58,60
58,60
58,61
59,61
59,61
59,61
59,62
60,62
60,62
60,63
60,63
61,63
61,63
61,64
62,64
62,64
62,65
62,65
63,65
63,65
63,66
63,66
64,66
64,66
64,67
65,67
65,67
65,68
65,68
66,68
66,68
66,69
66,69
67,69
67,69
67,70
67,70
67,70
68,70
68,71
69,71
69,71
69,72
69,72
70,72
70,72
70,73
71,73
71,73
71,73
71,74
72,74
72,74
72,75
72,75
73,75
73,75
73,76
73,76
74,76
74,77
74,77
75,77
75,77
75,78
75,78
76,78
76,78
76,79
76,79
77,79
77,22
77,23
78,23
78,23
78,23
78,24
79,24
22,24
22,24
23,25
23,25
23,25
23,25
24,26
24,26
24,26
24,27
25,27
25,27
25,27
26,28
26,28
26,28
26,28
27,29
27,29
27,29
27,30
28,30
28,30
28,30
29,31
29,31
29,31
29,31
30,32
30,32
30,32
30,32
31,33
31,33
31,33
31,34
38,40
39,41
39,42
40,42
40,42
40,42
41,43
41,43
41,43
41,44
42,44
42,44
42,44
42,45
43,45
43,45
43,45
44,46
44,46
44,46
44,47
45,47
45,47
45,47
45,48
46,48
46,48
46,48
47,49
47,49
47,49
47,50
48,50
48,50
48,50
48,51
49,51
49,51
49,51
50,52
50,52
50,52
50,52
51,53
51,53
51,53
51,54
52,54
52,54
52,54
53,55
53,55
53,55
53,55
54,56
54,56
54,56
54,57
55,57
55,57
55,57
56,58
56,58
56,58
56,58
57,59
57,59
57,59
58,60
58,60
58,60
58,61
59,61
59,61
59,61
59,62
60,62
60,62
60,62
61,63
61,63
61,63
61,64
62,64
62,64
62,64
62,65
63,65
63,65
63,65
64,66
64,66
64,66
64,66
65,67
65,67
65,67
65,68
66,68
66,68
66,68
67,69
67,69
67,69
67,69
68,70
68,70
68,70
68,71
69,71
69,71
69,71
70,72
70,72
70,72
70,72
71,73
71,73
71,73
71,73
72,74
72,74
72,74
72,75
73,75
73,75
73,75
74,76
74,76
74,76
74,76
75,77
75,77
75,77
75,78
76,78
76,78
76,78
77,79
77,79
77,79
77,22
78,23
78,23
78,23
78,23
79,24
22,24
23,24
23,24
23,25
23,25
24,25
24,26
24,26
25,26
25,27
25,27
26,27
26,27
26,28
27,28
27,28
27,29
27,29
28,29
28,29
28,30
28,30
29,30
29,31
29,31
29,31
30,31
30,32
30,32
31,32
31,32
31,33
31,33
32,33
32,34
32,34
33,34
33,34
33,35
33,35
34,35
34,35
34,36
34,36
35,36
35,37
35,37
36,37
36,37
36,38
36,38
37,38
37,38
37,39
37,39
38,39
38,40
38,40
39,40
39,40
39,41
39,41
40,41
40,41
40,42
40,42
41,42
41,43
41,43
42,43
42,43
42,44
42,44
43,44
43,44
43,45
43,45
44,45
44,46
44,46
45,46
45,46
45,47
45,47
46,47
46,48
46,48
46,48
47,48
47,49
47,49
48,49
48,49
48,50
48,50
49,50
49,50
49,51
49,51
50,51
50,52
50,52
51,52
51,52
51,53
51,53
52,53
52,53
52,54
52,54
53,54
53,54
53,55
54,55
54,55
54,56
54,56
55,56
55,56
55,57
55,57
56,57
56,57
56,58
57,58
57,58
57,59
57,59
58,59
58,59
58,60
58,60
59,60
59,60
59,61
60,61
60,61
60,61
60,62
61,62
61,62
61,63
61,63
62,63
62,63
62,64
62,64
63,64
63,65
63,65
64,65
64,65
64,66
64,66
65,66
65,66
65,67
65,67
66,67
66,67
66,68
67,68
67,68
67,69
68,69
68,69
68,70
68,70
69,70
69,71
69,71
70,71
70,71
70,72
70,72
71,72
71,72
71,73
72,73
72,73
72,74
72,74
73,74
73,74
73,75
73,75
74,75
74,75
74,76
75,76
75,76
75,76
75,77
76,77
76,77
76,78
76,78
77,78
77,78
77,79
78,79
78,79
78,22
78,23
79,23
79,23
22,23
23,24
23,24
23,24
24,25
24,25
24,25
24,25
25,26
25,26
25,26
25,26
26,27
26,27
26,27
27,27
27,28
27,28
27,28
28,29
28,29
28,29
28,29
29,30
29,30
29,30
29,30
30,31
30,31
30,31
31,32
31,32
31,32
31,32
32,33
32,33
32,33
32,33
33,34
33,34
33,34
34,34
34,35
34,35
34,35
35,36
35,36
35,36
35,36
36,37
36,37
36,37
37,37
37,38
37,38
37,38
38,39
38,39
38,39
38,39
39,40
39,40
39,40
40,40
40,41
40,41
40,41
41,42
41,42
41,42
41,42
42,43
42,43
42,43
43,43
43,44
43,44
43,44
44,45
44,45
44,45
44,45
45,46
45,46
45,46
45,46
46,47
46,47
46,47
47,47
47,48
47,48
47,48
48,49
48,49
48,49
48,49
49,50
49,50
49,50
50,50
50,51
50,51
50,51
51,51
51,52
51,52
51,52
52,53
52,53
52,53
53,53
53,54
53,54
53,54
54,54
54,55
54,55
55,56
55,56
55,56
56,56
56,57
56,57
56,57
57,58
57,58
57,58
58,58
58,59
58,59
58,59
59,59
59,60
59,60
59,60
60,60
60,61
60,61
60,61
61,62
61,62
61,62
62,62
62,63
62,63
62,63
63,64
63,64
63,64
63,64
64,65
64,65
64,65
65,65
65,66
65,66
65,66
66,67
66,67
66,67
67,67
67,68
67,68
67,68
68,69
68,69
68,69
69,69
69,70
69,70
69,70
70,71
70,71
70,71
70,71
71,72
71,72
71,72
72,72
72,73
72,73
72,73
73,74
73,74
73,74
73,74
74,75
74,75
74,75
75,75
75,76
75,76
75,76
76,76
76,77
76,77
76,77
77,78
77,78
77,78
78,78
78,79
78,79
78,79
22,23
23,24
23,24
23,24
24,25
24,25
25,25
25,26
26,27
26,27
27,27
27,28
27,28
28,29
28,29
28,29
29,29
29,30
29,30
29,30
30,31
30,31
30,31
30,31
31,32
31,32
31,32
31,32
32,33
32,33
32,33
94,24

Here data stop because both node crash.

nisow95612 · June 25, 2025, 6:41pm

Nothing (and not crash) in all 3 non-voting_only master nodes logs but generic
transport connection to [{COLDx}{id removed}{id removed}{ip removed}{ip removed:9300}{cmv}] closed by remote
and one complaint on active master node:
[master1] health check of [/var/lib/elasticsearch/nodes/0] took [17210ms] which is above the warn threshold of [5s] which was very unfortunate fail in VM migration but no reason to fill and crash voting_only nodes.

Christian_Dahlqvist · June 25, 2025, 6:57pm

What is the topology of your cluster? Why would you have more then one voting-only master node?

nisow95612 · June 25, 2025, 7:12pm

Sorry, never described this before. What means topology?

3 dedicated master nodes and 2 voting_only nodes shared with cold tier gives quorum 3 and allows losing two nodes. 3 master nodes and 1 voting_only you suggest give quorum 3 and allows losing only one node.

Christian_Dahlqvist · June 25, 2025, 7:35pm

If you have 3 dedicated master nodes I do not see the point in having ANY voting only master nodes, so I would recommend removing them.

nisow95612 · June 25, 2025, 7:52pm

Then I lose redundancy and cannot work with one surviving master node. Why you suggest this?

Update: Tried removing roles anyway and restart but no help: Nodes crashed again at exact same time.

RainTown · June 25, 2025, 9:44pm

Maybe because this is a direct cut and paste from the official documentation:

it is good practice to limit the number of master-eligible nodes in the cluster to three.

?

IIRC in a previous thread you had various dc/rack awareness settings, not mentioned here. Anyways, should that be the scenario here, then this section might also apply:

You can solve this by placing one master-eligible node in each of your two zones and adding a single extra master-eligible node in an independent third zone. The extra master-eligible node acts as a tiebreaker in cases where the two original zones are disconnected from each other. The extra tiebreaker node should be a [dedicated voting-only master-eligible node, also known as a dedicated tiebreaker.

To my understanding, you are not aligned with this recommendation either? Please correct if that’s inaccurate.

But:

What’s current output of a GET on /_cat/nodes?v

Seems like current crashing issues are maybe unrelated to master node count. So, you probably need to urgently understand what specifically is causing all the heap usage on the cold nodes.

leandrojmp · June 25, 2025, 10:14pm

How many nodes do you have in total and what are their roles and specs? Your description is a little confusing, you mention having 3 dedicated master nodes and 2 voting only nodes with also data_cold role, but you need to have data_hot/data_content/ingest nodes.

If those roles are on the dedicated master nodes, them they are not dedicated master nodes, they are data and master nodes.

Share the response of GET /_cat/nodes?v on Kibana dev tools.

nisow95612 · June 25, 2025, 10:50pm

Sorry and thank you for seeing my mistakes.
Yes, dedicated masters not dedicated because they store kibana and security indices.
Yes, hot/warm nodes exist to get working tiers. I forget because they continued work no problem.

Thank you both for GET /_cat/nodes?v. Very useful for checking roles!

heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
          27          98  13    6.75    8.15     8.15 hw        -      DATA1
          49          98  13    6.90    8.00     7.80 hw        -      DATA2
          42          99   0    0.03    0.08     0.15 ms        -      master1
          63          95   1    0.04    0.09     0.05 ms        *      master2
          63          98   1    0.13    0.06     0.03 ms        -      master3
          15          72   2    0.59    0.84     1.09 c         -      COLD1
           9          87   2    0.17    0.17     0.13 c         -      COLD2

Yes I agree warm tier like this is bit useless.

nisow95612 · June 25, 2025, 10:56pm

I agree it is wrong with recommendations for best performance. Is trading performance for more redundancy wrong? Slower is okay here but less redundancy is problem.

Update: Yes I agree redundancy of this cluster is super dumb but that is no reason for sudden heap full crashing both cold nodes.

leandrojmp · June 25, 2025, 11:36pm

You forgot the specs.

What are the specs of each node? How many CPUs? How many Memory? What is the Heap configured? What is the disk type? HDD or SSD?

Also, your hot nodes are also warm nodes, this does not make much sense as the goal for data tiering is to have different hardware for different tiers, have hot and warm roles on the same node does not help with anything.

I'm not sure and may be wrong, but I think this could also lead to unnecessary shard movement if you are using ILM and moving data from hot to warm.

You already have 3 master-eligible nodes, this is enough for redundancy in small cluster, I've had one with 25 nodes and just 3 master-eligible without any issues.

I don't think having 2 voting-only nodes would help with redundancy, it depends on a lot of things.

Also, your issue may not even be related to it as you said that after changing the nodes to data_cold only and it keeps happening.

You would need to provide more context, like the specs of your nodes, the amount of data you have.

Run this on Kibana Deve tools and share the result, this would show the disk usage of your nodes:

GET /_cat/nodes?v&h=name,role,disk.used_percent,disk.used,disk.avail&s=role

Do you have any ILM policies configured?

RainTown · June 26, 2025, 8:06am

GET _nodes/stats/jvm

Might help, but likely just tells you same story, something is using a lot of heap. Do you have a heap dump from a previous crash? If so, analyse that.

Analysing JVM heap usage on a running system isn’t trivial, has dangers, but there are tools. jmap, visualvm…

nisow95612 · June 26, 2025, 1:25pm

Good ideas.
Looking at /_nodes/stats/jvm is how I know cold nodes are very idle and GC happens like every three hours. I monitor heap_used_percent with script so I see this crystal clear. Looks like too much RAM on cold nodes maybe? And from this behavior it is suddenly heap full and GC not bringing memory down. You want graph yourself from values or you prefer picture?
Update: Put data in csv to Heap sudenly full on all voting-only master nodes (7.17.28).

Yes, systemd said Heap dump file created. Look like good clue but I know nothing about heap dumps. Where I start?

RainTown · June 26, 2025, 2:07pm

I prefer pictures.

Find the heap dump ! Expect it to be a large file.

nisow95612 · June 26, 2025, 2:26pm

Very boring except both node crash after this.

Easy, systemd give good hint: Dumping heap to /var/lib/elasticsearch/java_pid398746.hprof and yes thats big! Good think you mention it because it is one file like this for each restart.

RainTown · June 26, 2025, 4:39pm

The Eclipse Memory Analyzer is a fast and feature-richJava heap analyzer that helps you find memory leaks and reduce memory consumption

the jdk bundled with elasticsearch probably includes jmap. There’s visualVM. You have various alternatives. Good luck.

nisow95612 · June 27, 2025, 9:47am

You started so easy. Didn't expect difficulty to go this high this fast.

RainTown · June 27, 2025, 2:22pm

Wildly guessing, your data_cold nodes might be getting expensive (for them) searches. You didn’t provide any specs.

nisow95612 · July 2, 2025, 10:34am

Also good idea. I see it can be somebody putting dumb searches into kibana. But you can't know who searches what and when without subscription, right?

Meanwhile I discovered somebody enabled swap on these nodes. Fixed.

RainTown · July 2, 2025, 11:50am

lol. That happens quite often on here

For finding dumb queries you can look at the slow log ?

Topic		Replies	Views
Master Node vs. Data Node Architecture Elasticsearch	7	11378	July 6, 2017
When one node goes down, memory usage jumps several gigabytes on other nodes Elasticsearch	7	568	July 6, 2017
Cascading cluster failure Elasticsearch	13	512	July 6, 2017
New User -- Index Settings Reccomdendations and Suggestions Elasticsearch	8	464	July 6, 2017
Can I change type of current master node and how to increase performance for my Elasticsearch cluster Elasticsearch	5	1002	July 6, 2017

Heap sudenly full on all voting-only master nodes (7.17.28)

Related topics