Heap sudenly full on all voting-only master nodes (7.17.28)

I have no idea what happened. No clue in node logs.
These nodes are data_cold,master,voting_only and they are very idle, GC happening like every three hours.

Suddenly heap on both nodes filled and they crashed.

2025-06-25 logs of first node

Elastic:

[2025-06-25T19:21:59,987][WARN ][o.e.m.j.JvmGcMonitorService] [COLD1] [gc][1991139] overhead, spent [5.2s] collecting in the last [5.4s]
[2025-06-25T19:22:00,556][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD1] attempting to trigger G1GC due to high heap usage [31975780480]
[2025-06-25T19:22:00,679][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD1] attempt to trigger young GC failed to bring memory down, triggering full GC
[2025-06-25T19:22:00,936][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD1] GC did not bring memory usage down, before [31975780480], after [32002497128], allocations [68], duration [380]
[2025-06-25T19:22:01,013][INFO ][o.e.m.j.JvmGcMonitorService] [COLD1] [gc][1991140] overhead, spent [271ms] collecting in the last [1s]
[2025-06-25T19:22:08,809][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD1] attempting to trigger G1GC due to high heap usage [33109914656]
[2025-06-25T19:22:08,809][WARN ][o.e.t.ThreadPool         ] [COLD1] timer thread slept for [6.6s/6672ms] on absolute clock which is above the warn threshold of [5000ms]
[2025-06-25T19:22:08,820][WARN ][o.e.t.ThreadPool         ] [COLD1] timer thread slept for [6.6s/6672162398ns] on relative clock which is above the warn threshold of [5000ms]
[2025-06-25T19:22:08,821][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD1] attempt to trigger young GC failed to bring memory down, triggering full GC
[2025-06-25T19:22:09,085][WARN ][o.e.m.j.JvmGcMonitorService] [COLD1] [gc][1991142] overhead, spent [6.5s] collecting in the last [6.7s]
[2025-06-25T19:22:09,086][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD1] GC did bring memory usage down, before [33109914656], after [33106726224], allocations [1], duration [276]
[2025-06-25T19:22:16,298][WARN ][o.e.t.ThreadPool         ] [COLD1] timer thread slept for [6.7s/6738ms] on absolute clock which is above the warn threshold of [5000ms]
[2025-06-25T19:22:16,299][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD1] attempting to trigger G1GC due to high heap usage [33261247848]
[2025-06-25T19:22:16,300][WARN ][o.e.t.ThreadPool         ] [COLD1] timer thread slept for [6.7s/6738267801ns] on relative clock which is above the warn threshold of [5000ms]
[2025-06-25T19:22:37,672][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD1] could not acquire lock within 500ms when attempting to trigger G1GC due to high heap usage
[2025-06-25T19:22:37,674][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD1] memory usage not down after [0], before [33261247848], after [33261247848]
[2025-06-25T19:22:37,675][WARN ][o.e.m.j.JvmGcMonitorService] [COLD1] [gc][1991143] overhead, spent [7.2s] collecting in the last [7.4s]
[2025-06-25T19:22:37,872][WARN ][o.e.t.ThreadPool         ] [COLD1] timer thread slept for [21.5s/21574ms] on absolute clock which is above the warn threshold of [5000ms]
[2025-06-25T19:22:37,873][WARN ][o.e.t.ThreadPool         ] [COLD1] timer thread slept for [21.5s/21574291266ns] on relative clock which is above the warn threshold of [5000ms]
[2025-06-25T19:22:38,172][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD1] could not acquire lock within 500ms when attempting to trigger G1GC due to high heap usage
[2025-06-25T19:22:38,173][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD1] could not acquire lock within 500ms when attempting to trigger G1GC due to high heap usage
[2025-06-25T19:22:38,173][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD1] memory usage not down after [0], before [33261247848], after [33261247848]
[2025-06-25T19:22:38,174][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD1] memory usage not down after [0], before [33261247848], after [33261247848]
[2025-06-25T19:22:38,178][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD1] could not acquire lock within 500ms when attempting to trigger G1GC due to high heap usage
[2025-06-25T19:22:38,179][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD1] memory usage not down after [0], before [33261247848], after [33261247848]
[2025-06-25T19:22:38,196][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD1] could not acquire lock within 500ms when attempting to trigger G1GC due to high heap usage
[2025-06-25T19:22:38,197][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD1] memory usage not down after [0], before [33261247848], after [33261247848]
[2025-06-25T19:22:41,056][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD1] could not acquire lock within 500ms when attempting to trigger G1GC due to high heap usage
[2025-06-25T19:22:41,057][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD1] memory usage not down after [0], before [33261247848], after [33261247848]
[2025-06-25T19:22:42,756][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD1] could not acquire lock within 500ms when attempting to trigger G1GC due to high heap usage
[2025-06-25T19:22:42,756][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD1] memory usage not down after [0], before [33261247848], after [33261247848]

Systemd:

Jun 25 19:22:16 cold1 systemd-entrypoint[29333]: java.lang.OutOfMemoryError: Java heap space
Jun 25 19:22:16 cold1 systemd-entrypoint[29333]: Dumping heap to /var/lib/elasticsearch/java_pid29333.hprof ...
Jun 25 19:22:55 cold1 systemd-entrypoint[29333]: Heap dump file created [28691299403 bytes in 39.631 secs]
Jun 25 19:22:55 cold1 systemd-entrypoint[29333]: Terminating due to java.lang.OutOfMemoryError: Java heap space
Jun 25 19:22:58 cold1 systemd[1]: elasticsearch.service: Main process exited, code=exited, status=3/NOTIMPLEMENTED
Jun 25 19:22:58 cold1 systemd[1]: elasticsearch.service: Failed with result 'exit-code'.
2025-06-25 logs of second node

Elastic:

[2025-06-25T19:22:00,556][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD2] attempting to trigger G1GC due to high heap usage [31911675176]
[2025-06-25T19:22:00,564][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD2] GC did bring memory usage down, before [31911675176], after [31896038912], allocations [0], duration [8]
[2025-06-25T19:22:00,937][INFO ][o.e.m.j.JvmGcMonitorService] [COLD2] [gc][1991144] overhead, spent [366ms] collecting in the last [1.1s]
[2025-06-25T19:22:06,970][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD2] attempting to trigger G1GC due to high heap usage [31922278808]
[2025-06-25T19:22:06,983][WARN ][o.e.t.ThreadPool         ] [COLD2] timer thread slept for [6s/6034ms] on absolute clock which is above the warn threshold of [5000ms]
[2025-06-25T19:22:06,984][WARN ][o.e.t.ThreadPool         ] [COLD2] timer thread slept for [6s/6034459557ns] on relative clock which is above the warn threshold of [5000ms]
[2025-06-25T19:22:06,985][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD2] attempt to trigger young GC failed to bring memory down, triggering full GC
[2025-06-25T19:22:07,252][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD2] GC did not bring memory usage down, before [31922278808], after [31972651264], allocations [1], duration [281]
[2025-06-25T19:22:07,254][WARN ][o.e.m.j.JvmGcMonitorService] [COLD2] [gc][1991145] overhead, spent [5.9s] collecting in the last [6s]
[2025-06-25T19:22:09,506][INFO ][o.e.m.j.JvmGcMonitorService] [COLD2] [gc][1991147] overhead, spent [525ms] collecting in the last [1.2s]
[2025-06-25T19:22:17,303][WARN ][o.e.t.ThreadPool         ] [COLD2] timer thread slept for [7.3s/7380ms] on absolute clock which is above the warn threshold of [5000ms]
[2025-06-25T19:22:17,303][WARN ][o.e.t.ThreadPool         ] [COLD2] timer thread slept for [7.3s/7380310416ns] on relative clock which is above the warn threshold of [5000ms]
[2025-06-25T19:22:17,303][WARN ][o.e.m.j.JvmGcMonitorService] [COLD2] [gc][1991148] overhead, spent [7.7s] collecting in the last [7.7s]
[2025-06-25T19:22:17,303][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD2] attempting to trigger G1GC due to high heap usage [33242973680]
[2025-06-25T19:22:17,848][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD2] attempt to trigger young GC failed to bring memory down, triggering full GC
[2025-06-25T19:22:17,848][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD2] could not acquire lock within 500ms when attempting to trigger G1GC due to high heap usage
[2025-06-25T19:22:18,067][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD2] GC did not bring memory usage down, before [33242973680], after [33242997328], allocations [1], duration [764]
[2025-06-25T19:22:18,068][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD2] memory usage not down after [0], before [33242973680], after [33242997328]
[2025-06-25T19:22:17,848][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD2] could not acquire lock within 500ms when attempting to trigger G1GC due to high heap usage
[2025-06-25T19:22:18,069][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD2] memory usage not down after [0], before [33242973680], after [33242997328]
[2025-06-25T19:22:18,071][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [COLD2] memory usage not down after [223], before [33242997328], after [33242997328]

Systemd:

Jun 25 19:22:26 cold2 systemd-entrypoint[30961]: java.lang.OutOfMemoryError: Java heap space
Jun 25 19:22:26 cold2 systemd-entrypoint[30961]: Dumping heap to /var/lib/elasticsearch/java_pid30961.hprof ...
Jun 25 19:23:13 cold2 systemd-entrypoint[30961]: Heap dump file created [28882639542 bytes in 46.862 secs]
Jun 25 19:23:13 cold2 systemd-entrypoint[30961]: Terminating due to java.lang.OutOfMemoryError: Java heap space
Jun 25 19:23:16 cold2 systemd[1]: elasticsearch.service: Main process exited, code=exited, status=3/NOTIMPLEMENTED
Jun 25 19:23:16 cold2 systemd[1]: elasticsearch.service: Failed with result 'exit-code'.

Update: changed role of nodes to only data_cold and restarted elasticsearch.
This was no help because nodes crashed same way again.

Update:

heap_used_percent for cold nodes every minute
COLD1 heap_used_percent,COLD2 heap_used_percent
63,67
63,67
63,68
64,68
64,68
64,68
64,69
65,69
65,69
65,70
66,70
66,70
66,70
66,71
67,71
67,71
67,71
67,72
68,72
68,72
68,73
69,73
69,73
69,73
69,74
70,74
70,74
70,74
70,75
71,75
71,75
71,75
71,76
72,76
72,76
72,77
73,77
73,77
73,77
73,78
74,78
74,78
74,78
74,79
75,79
75,79
75,22
75,23
76,23
76,23
76,23
77,24
77,24
77,24
77,24
78,25
78,25
78,25
78,26
79,26
22,26
23,26
23,27
23,27
23,27
24,27
24,28
24,28
24,28
25,28
25,29
25,29
25,29
26,30
26,30
26,30
27,30
27,31
27,31
27,31
28,32
28,32
28,32
29,32
29,33
29,33
29,33
30,33
30,34
30,34
30,34
31,34
31,35
31,35
31,35
32,36
32,36
32,36
33,36
33,37
33,37
33,37
34,37
34,38
34,38
35,38
35,38
35,39
35,39
36,39
36,40
36,40
36,40
37,40
37,41
37,41
37,41
38,42
38,42
38,42
38,42
39,43
39,43
39,43
40,43
40,44
40,44
40,44
41,44
41,45
42,45
42,46
42,46
42,46
43,46
43,47
43,47
43,47
44,47
44,48
44,48
45,48
45,49
45,49
45,49
46,50
46,50
46,50
47,50
47,51
47,51
47,51
48,51
48,52
48,52
49,52
49,52
49,53
49,53
50,53
50,54
50,54
50,54
51,54
51,55
51,55
51,55
52,55
52,56
52,56
53,56
53,57
53,57
53,57
54,57
54,58
54,58
54,58
55,58
55,59
55,59
56,59
56,60
56,60
56,60
57,60
57,61
57,61
57,61
58,61
58,62
58,62
58,62
59,63
59,63
59,63
60,63
60,64
60,64
60,64
61,64
61,65
61,65
62,65
62,65
62,66
62,66
63,66
63,67
63,67
63,67
64,67
64,68
64,68
64,68
65,68
65,69
65,69
66,69
66,70
66,70
66,70
67,70
67,71
67,71
67,71
68,71
68,72
68,72
68,72
69,72
69,73
69,73
70,73
70,74
70,74
70,74
71,74
71,75
71,75
72,75
72,75
72,76
72,76
73,76
73,76
73,77
73,77
74,77
74,78
74,78
74,78
75,78
75,79
75,79
75,79
76,22
76,22
76,23
77,23
77,23
77,24
77,24
78,24
78,24
78,25
79,25
79,25
22,25
22,26
23,26
23,26
23,27
24,27
24,27
24,27
24,28
25,28
25,28
25,28
25,29
26,29
26,29
26,29
27,30
27,30
27,30
27,31
28,31
28,31
28,31
28,32
29,32
29,32
29,33
30,33
30,33
30,33
30,34
31,34
31,34
31,34
31,35
32,35
32,35
32,35
33,36
33,36
33,36
33,37
34,37
34,37
34,37
34,38
35,38
35,38
35,38
35,39
36,39
36,39
36,40
37,40
37,40
37,40
37,41
38,41
38,41
38,41
38,42
39,42
39,42
39,43
40,43
40,43
40,43
40,44
41,44
41,44
41,44
41,45
42,45
42,45
42,45
43,46
43,46
43,46
43,46
44,47
44,47
44,47
44,48
45,48
45,48
45,48
45,49
46,49
46,49
46,49
47,50
47,50
47,50
47,50
48,51
48,51
48,51
48,52
49,52
49,52
49,52
50,53
50,53
50,53
50,53
51,54
51,54
51,54
51,54
52,55
52,55
52,55
52,56
53,56
53,56
53,56
54,57
54,57
54,57
54,57
55,58
55,58
55,58
55,59
56,59
56,59
56,59
56,60
57,60
57,60
57,60
58,61
58,61
58,61
58,62
59,62
59,62
59,62
59,63
60,63
60,63
60,63
60,64
61,64
61,64
61,64
62,65
62,65
62,65
62,66
63,66
63,66
63,66
64,67
64,67
64,67
64,67
65,68
65,68
65,68
65,69
66,69
66,69
66,69
66,70
67,70
67,70
67,70
68,71
68,71
68,71
68,71
69,72
69,72
69,72
69,73
70,73
70,73
70,73
71,74
71,74
71,74
71,74
72,75
72,75
72,75
72,76
73,76
73,76
73,76
73,77
74,77
74,77
74,77
75,78
75,78
75,78
75,78
76,79
76,79
76,79
76,22
77,23
77,23
77,23
78,23
78,24
78,24
78,24
79,24
79,25
22,25
23,25
23,26
23,26
23,26
24,26
24,27
24,27
25,27
25,27
25,28
25,28
26,28
26,28
26,29
26,29
26,29
27,30
27,30
27,30
28,30
28,31
28,31
29,31
29,31
29,32
29,32
30,32
30,33
30,33
31,33
31,33
31,34
31,34
32,34
32,34
32,35
32,35
33,35
33,35
33,36
33,36
34,36
34,37
34,37
35,37
35,37
35,38
35,38
36,38
36,38
36,39
36,39
37,39
37,40
37,40
38,40
38,40
38,41
38,41
39,41
39,41
39,42
39,42
40,42
40,42
40,43
40,43
41,43
41,43
41,44
42,44
42,44
42,45
42,45
43,45
43,45
43,46
43,46
44,46
44,46
45,47
45,47
45,48
45,48
46,48
46,48
46,49
47,49
47,49
47,50
47,50
48,50
48,50
48,51
48,51
49,51
49,52
49,52
50,52
50,52
50,53
50,53
51,53
51,53
51,54
51,54
52,54
52,54
52,55
53,55
53,55
53,56
53,56
54,56
54,56
54,57
54,57
55,57
55,57
55,58
56,58
56,58
56,59
56,59
57,59
57,59
57,60
57,60
58,60
58,60
58,61
59,61
59,61
59,61
59,62
60,62
60,62
60,63
60,63
61,63
61,63
61,64
62,64
62,64
62,65
62,65
63,65
63,65
63,66
63,66
64,66
64,66
64,67
65,67
65,67
65,68
65,68
66,68
66,68
66,69
66,69
67,69
67,69
67,70
67,70
67,70
68,70
68,71
69,71
69,71
69,72
69,72
70,72
70,72
70,73
71,73
71,73
71,73
71,74
72,74
72,74
72,75
72,75
73,75
73,75
73,76
73,76
74,76
74,77
74,77
75,77
75,77
75,78
75,78
76,78
76,78
76,79
76,79
77,79
77,22
77,23
78,23
78,23
78,23
78,24
79,24
22,24
22,24
23,25
23,25
23,25
23,25
24,26
24,26
24,26
24,27
25,27
25,27
25,27
26,28
26,28
26,28
26,28
27,29
27,29
27,29
27,30
28,30
28,30
28,30
29,31
29,31
29,31
29,31
30,32
30,32
30,32
30,32
31,33
31,33
31,33
31,34
38,40
39,41
39,42
40,42
40,42
40,42
41,43
41,43
41,43
41,44
42,44
42,44
42,44
42,45
43,45
43,45
43,45
44,46
44,46
44,46
44,47
45,47
45,47
45,47
45,48
46,48
46,48
46,48
47,49
47,49
47,49
47,50
48,50
48,50
48,50
48,51
49,51
49,51
49,51
50,52
50,52
50,52
50,52
51,53
51,53
51,53
51,54
52,54
52,54
52,54
53,55
53,55
53,55
53,55
54,56
54,56
54,56
54,57
55,57
55,57
55,57
56,58
56,58
56,58
56,58
57,59
57,59
57,59
58,60
58,60
58,60
58,61
59,61
59,61
59,61
59,62
60,62
60,62
60,62
61,63
61,63
61,63
61,64
62,64
62,64
62,64
62,65
63,65
63,65
63,65
64,66
64,66
64,66
64,66
65,67
65,67
65,67
65,68
66,68
66,68
66,68
67,69
67,69
67,69
67,69
68,70
68,70
68,70
68,71
69,71
69,71
69,71
70,72
70,72
70,72
70,72
71,73
71,73
71,73
71,73
72,74
72,74
72,74
72,75
73,75
73,75
73,75
74,76
74,76
74,76
74,76
75,77
75,77
75,77
75,78
76,78
76,78
76,78
77,79
77,79
77,79
77,22
78,23
78,23
78,23
78,23
79,24
22,24
23,24
23,24
23,25
23,25
24,25
24,26
24,26
25,26
25,27
25,27
26,27
26,27
26,28
27,28
27,28
27,29
27,29
28,29
28,29
28,30
28,30
29,30
29,31
29,31
29,31
30,31
30,32
30,32
31,32
31,32
31,33
31,33
32,33
32,34
32,34
33,34
33,34
33,35
33,35
34,35
34,35
34,36
34,36
35,36
35,37
35,37
36,37
36,37
36,38
36,38
37,38
37,38
37,39
37,39
38,39
38,40
38,40
39,40
39,40
39,41
39,41
40,41
40,41
40,42
40,42
41,42
41,43
41,43
42,43
42,43
42,44
42,44
43,44
43,44
43,45
43,45
44,45
44,46
44,46
45,46
45,46
45,47
45,47
46,47
46,48
46,48
46,48
47,48
47,49
47,49
48,49
48,49
48,50
48,50
49,50
49,50
49,51
49,51
50,51
50,52
50,52
51,52
51,52
51,53
51,53
52,53
52,53
52,54
52,54
53,54
53,54
53,55
54,55
54,55
54,56
54,56
55,56
55,56
55,57
55,57
56,57
56,57
56,58
57,58
57,58
57,59
57,59
58,59
58,59
58,60
58,60
59,60
59,60
59,61
60,61
60,61
60,61
60,62
61,62
61,62
61,63
61,63
62,63
62,63
62,64
62,64
63,64
63,65
63,65
64,65
64,65
64,66
64,66
65,66
65,66
65,67
65,67
66,67
66,67
66,68
67,68
67,68
67,69
68,69
68,69
68,70
68,70
69,70
69,71
69,71
70,71
70,71
70,72
70,72
71,72
71,72
71,73
72,73
72,73
72,74
72,74
73,74
73,74
73,75
73,75
74,75
74,75
74,76
75,76
75,76
75,76
75,77
76,77
76,77
76,78
76,78
77,78
77,78
77,79
78,79
78,79
78,22
78,23
79,23
79,23
22,23
23,24
23,24
23,24
24,25
24,25
24,25
24,25
25,26
25,26
25,26
25,26
26,27
26,27
26,27
27,27
27,28
27,28
27,28
28,29
28,29
28,29
28,29
29,30
29,30
29,30
29,30
30,31
30,31
30,31
31,32
31,32
31,32
31,32
32,33
32,33
32,33
32,33
33,34
33,34
33,34
34,34
34,35
34,35
34,35
35,36
35,36
35,36
35,36
36,37
36,37
36,37
37,37
37,38
37,38
37,38
38,39
38,39
38,39
38,39
39,40
39,40
39,40
40,40
40,41
40,41
40,41
41,42
41,42
41,42
41,42
42,43
42,43
42,43
43,43
43,44
43,44
43,44
44,45
44,45
44,45
44,45
45,46
45,46
45,46
45,46
46,47
46,47
46,47
47,47
47,48
47,48
47,48
48,49
48,49
48,49
48,49
49,50
49,50
49,50
50,50
50,51
50,51
50,51
51,51
51,52
51,52
51,52
52,53
52,53
52,53
53,53
53,54
53,54
53,54
54,54
54,55
54,55
55,56
55,56
55,56
56,56
56,57
56,57
56,57
57,58
57,58
57,58
58,58
58,59
58,59
58,59
59,59
59,60
59,60
59,60
60,60
60,61
60,61
60,61
61,62
61,62
61,62
62,62
62,63
62,63
62,63
63,64
63,64
63,64
63,64
64,65
64,65
64,65
65,65
65,66
65,66
65,66
66,67
66,67
66,67
67,67
67,68
67,68
67,68
68,69
68,69
68,69
69,69
69,70
69,70
69,70
70,71
70,71
70,71
70,71
71,72
71,72
71,72
72,72
72,73
72,73
72,73
73,74
73,74
73,74
73,74
74,75
74,75
74,75
75,75
75,76
75,76
75,76
76,76
76,77
76,77
76,77
77,78
77,78
77,78
78,78
78,79
78,79
78,79
22,23
23,24
23,24
23,24
24,25
24,25
25,25
25,26
26,27
26,27
27,27
27,28
27,28
28,29
28,29
28,29
29,29
29,30
29,30
29,30
30,31
30,31
30,31
30,31
31,32
31,32
31,32
31,32
32,33
32,33
32,33
94,24

Here data stop because both node crash.

Nothing (and not crash) in all 3 non-voting_only master nodes logs but generic
transport connection to [{COLDx}{id removed}{id removed}{ip removed}{ip removed:9300}{cmv}] closed by remote
and one complaint on active master node:
[master1] health check of [/var/lib/elasticsearch/nodes/0] took [17210ms] which is above the warn threshold of [5s] which was very unfortunate fail in VM migration but no reason to fill and crash voting_only nodes.

What is the topology of your cluster? Why would you have more then one voting-only master node?

Sorry, never described this before. What means topology?

3 dedicated master nodes and 2 voting_only nodes shared with cold tier gives quorum 3 and allows losing two nodes. 3 master nodes and 1 voting_only you suggest give quorum 3 and allows losing only one node.

If you have 3 dedicated master nodes I do not see the point in having ANY voting only master nodes, so I would recommend removing them.

Then I lose redundancy and cannot work with one surviving master node. Why you suggest this?

Update: Tried removing roles anyway and restart but no help: Nodes crashed again at exact same time.

Maybe because this is a direct cut and paste from the official documentation:

it is good practice to limit the number of master-eligible nodes in the cluster to three.

?

IIRC in a previous thread you had various dc/rack awareness settings, not mentioned here. Anyways, should that be the scenario here, then this section might also apply:

You can solve this by placing one master-eligible node in each of your two zones and adding a single extra master-eligible node in an independent third zone. The extra master-eligible node acts as a tiebreaker in cases where the two original zones are disconnected from each other. The extra tiebreaker node should be a [dedicated voting-only master-eligible node, also known as a dedicated tiebreaker.

To my understanding, you are not aligned with this recommendation either? Please correct if that’s inaccurate.

But:

What’s current output of a GET on /_cat/nodes?v

Seems like current crashing issues are maybe unrelated to master node count. So, you probably need to urgently understand what specifically is causing all the heap usage on the cold nodes.

1 Like

How many nodes do you have in total and what are their roles and specs? Your description is a little confusing, you mention having 3 dedicated master nodes and 2 voting only nodes with also data_cold role, but you need to have data_hot/data_content/ingest nodes.

If those roles are on the dedicated master nodes, them they are not dedicated master nodes, they are data and master nodes.

Share the response of GET /_cat/nodes?v on Kibana dev tools.

1 Like

Sorry and thank you for seeing my mistakes.
Yes, dedicated masters not dedicated because they store kibana and security indices.
Yes, hot/warm nodes exist to get working tiers. I forget because they continued work no problem.

Thank you both for GET /_cat/nodes?v. Very useful for checking roles!

heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
          27          98  13    6.75    8.15     8.15 hw        -      DATA1
          49          98  13    6.90    8.00     7.80 hw        -      DATA2
          42          99   0    0.03    0.08     0.15 ms        -      master1
          63          95   1    0.04    0.09     0.05 ms        *      master2
          63          98   1    0.13    0.06     0.03 ms        -      master3
          15          72   2    0.59    0.84     1.09 c         -      COLD1
           9          87   2    0.17    0.17     0.13 c         -      COLD2

Yes I agree warm tier like this is bit useless.

I agree it is wrong with recommendations for best performance. Is trading performance for more redundancy wrong? Slower is okay here but less redundancy is problem.

Update: Yes I agree redundancy of this cluster is super dumb but that is no reason for sudden heap full crashing both cold nodes.

You forgot the specs.

What are the specs of each node? How many CPUs? How many Memory? What is the Heap configured? What is the disk type? HDD or SSD?

Also, your hot nodes are also warm nodes, this does not make much sense as the goal for data tiering is to have different hardware for different tiers, have hot and warm roles on the same node does not help with anything.

I'm not sure and may be wrong, but I think this could also lead to unnecessary shard movement if you are using ILM and moving data from hot to warm.

You already have 3 master-eligible nodes, this is enough for redundancy in small cluster, I've had one with 25 nodes and just 3 master-eligible without any issues.

I don't think having 2 voting-only nodes would help with redundancy, it depends on a lot of things.

Also, your issue may not even be related to it as you said that after changing the nodes to data_cold only and it keeps happening.

You would need to provide more context, like the specs of your nodes, the amount of data you have.

Run this on Kibana Deve tools and share the result, this would show the disk usage of your nodes:

GET /_cat/nodes?v&h=name,role,disk.used_percent,disk.used,disk.avail&s=role

Do you have any ILM policies configured?

GET _nodes/stats/jvm

Might help, but likely just tells you same story, something is using a lot of heap. Do you have a heap dump from a previous crash? If so, analyse that.

Analysing JVM heap usage on a running system isn’t trivial, has dangers, but there are tools. jmap, visualvm…

1 Like

Good ideas.
Looking at /_nodes/stats/jvm is how I know cold nodes are very idle and GC happens like every three hours. I monitor heap_used_percent with script so I see this crystal clear. Looks like too much RAM on cold nodes maybe? And from this behavior it is suddenly heap full and GC not bringing memory down. You want graph yourself from values or you prefer picture?
Update: Put data in csv to Heap sudenly full on all voting-only master nodes (7.17.28).

Yes, systemd said Heap dump file created. Look like good clue but I know nothing about heap dumps. Where I start?

I prefer pictures.

Find the heap dump ! Expect it to be a large file.


Very boring except both node crash after this.

Easy, systemd give good hint: Dumping heap to /var/lib/elasticsearch/java_pid398746.hprof and yes thats big! Good think you mention it because it is one file like this for each restart.

The Eclipse Memory Analyzer is a fast and feature-richJava heap analyzer that helps you find memory leaks and reduce memory consumption

the jdk bundled with elasticsearch probably includes jmap. There’s visualVM. You have various alternatives. Good luck.

You started so easy. Didn't expect difficulty to go this high this fast.

Wildly guessing, your data_cold nodes might be getting expensive (for them) searches. You didn’t provide any specs.

Also good idea. I see it can be somebody putting dumb searches into kibana. But you can't know who searches what and when without subscription, right?

Meanwhile I discovered somebody enabled swap on these nodes. Fixed.

lol. That happens quite often on here

For finding dumb queries you can look at the slow log ?