I have a two-node Elasticsearch cluster with both nodes running 8.13.2. Since I added the second node to the cluster, the master node has been crashing regularly. The logs show:
[2024-04-17T15:57:09,258][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [zeek1] fatal error in thread [elasticsearch[zeek1][generic][T#3]], exiting
java.lang.InternalError: a fault occurred in an unsafe memory access operation
at org.elasticsearch.indices.recovery.MultiChunkTransfer.handleItems(MultiChunkTransfer.java:126) ~[elasticsearch-8.13.2.jar:?]
at org.elasticsearch.indices.recovery.MultiChunkTransfer$1.write(MultiChunkTransfer.java:72) ~[elasticsearch-8.13.2.jar:?]
at org.elasticsearch.common.util.concurrent.AsyncIOProcessor.processList(AsyncIOProcessor.java:97) ~[elasticsearch-8.13.2.jar:?]
at org.elasticsearch.common.util.concurrent.AsyncIOProcessor.drainAndProcessAndRelease(AsyncIOProcessor.java:85) ~[elasticsearch-8.13.2.jar:?]
at org.elasticsearch.common.util.concurrent.AsyncIOProcessor.put(AsyncIOProcessor.java:76) ~[elasticsearch-8.13.2.jar:?]
at org.elasticsearch.indices.recovery.MultiChunkTransfer.addItem(MultiChunkTransfer.java:83) ~[elasticsearch-8.13.2.jar:?]
at org.elasticsearch.indices.recovery.MultiChunkTransfer.lambda$handleItems$4(MultiChunkTransfer.java:120) ~[elasticsearch-8.13.2.jar:?]
at org.elasticsearch.action.ActionListener$2.onResponse(ActionListener.java:171) ~[elasticsearch-8.13.2.jar:?]
at org.elasticsearch.action.ActionListenerImplementations$RunBeforeActionListener.onResponse(ActionListenerImplementations.java:307) ~[elasticsearch-8.13.2.jar:?]
at org.elasticsearch.action.ActionListenerImplementations$MappedActionListener.onResponse(ActionListenerImplementations.java:95) ~[elasticsearch-8.13.2.jar:?]
at org.elasticsearch.action.ActionListenerImplementations$RunBeforeActionListener.onResponse(ActionListenerImplementations.java:307) ~[elasticsearch-8.13.2.jar:?]
at org.elasticsearch.action.ActionListenerImplementations$RunBeforeActionListener.onResponse(ActionListenerImplementations.java:307) ~[elasticsearch-8.13.2.jar:?]
at org.elasticsearch.action.support.RetryableAction$RetryingListener.onResponse(RetryableAction.java:149) ~[elasticsearch-8.13.2.jar:?]
at org.elasticsearch.action.ActionListenerImplementations$RunBeforeActionListener.onResponse(ActionListenerImplementations.java:307) ~[elasticsearch-8.13.2.jar:?]
at org.elasticsearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:48) ~[elasticsearch-8.13.2.jar:?]
at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1465) ~[elasticsearch-8.13.2.jar:?]
at org.elasticsearch.transport.InboundHandler.doHandleResponse(InboundHandler.java:433) ~[elasticsearch-8.13.2.jar:?]
at org.elasticsearch.transport.InboundHandler$2.doRun(InboundHandler.java:390) ~[elasticsearch-8.13.2.jar:?]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:984) ~[elasticsearch-8.13.2.jar:?]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-8.13.2.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
at java.lang.Thread.run(Thread.java:1583) ~[?:?]
In a previous thread where someone else had a similar problem, it was suggested that low disk space (or low memory) could be the culprit. I don't have either of these problems.
In the other thread, someone recommended checking /var/log/kern.log
. Here's the contents of that log file:
Apr 17 12:38:37 zeek1 kernel: [7868553.586827] audit: type=1400 audit(1713371917.646:19): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="nvidia_modprobe//kmod" pid=3973329 comm="apparmor_parser"
Apr 17 12:38:37 zeek1 kernel: [7868553.697567] audit: type=1400 audit(1713371917.754:20): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="/usr/sbin/chronyd" pid=3973333 comm="apparmor_parser"
Apr 17 12:38:37 zeek1 kernel: [7868553.772613] audit: type=1400 audit(1713371917.830:21): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="tcpdump" pid=3973331 comm="apparmor_parser"
Apr 17 12:38:37 zeek1 kernel: [7868553.811950] audit: type=1400 audit(1713371917.870:22): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/usr/lib/NetworkManager/nm-dhcp-client.action" pid=3973330 comm="apparmor_parser"
Apr 17 12:38:37 zeek1 kernel: [7868553.814463] audit: type=1400 audit(1713371917.870:23): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/usr/lib/NetworkManager/nm-dhcp-helper" pid=3973330 comm="apparmor_parser"
Apr 17 12:38:57 zeek1 kernel: [7868573.492336] kauditd_printk_skb: 4 callbacks suppressed
Apr 17 12:38:57 zeek1 kernel: [7868573.492343] audit: type=1400 audit(1713371937.550:28): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/usr/lib/snapd/snap-confine" pid=3973762 comm="apparmor_parser"
Apr 17 12:38:57 zeek1 kernel: [7868573.527528] audit: type=1400 audit(1713371937.586:29): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="/usr/lib/snapd/snap-confine//mount-namespace-capture-helper" pid=3973762 comm="apparmor_parser"
Apr 17 12:39:00 zeek1 kernel: [7868576.128680] loop0: detected capacity change from 0 to 8
Apr 17 12:39:00 zeek1 kernel: [7868576.702014] audit: type=1400 audit(1713371940.758:30): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="/usr/lib/snapd/snap-confine" pid=3973945 comm="apparmor_parser"
Apr 17 12:39:00 zeek1 kernel: [7868576.704220] audit: type=1400 audit(1713371940.762:31): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="/usr/lib/snapd/snap-confine//mount-namespace-capture-helper" pid=3973945 comm="apparmor_parser"
Apr 17 12:50:26 zeek1 kernel: [7869262.421219] device eno2 entered promiscuous mode
Apr 17 12:51:50 zeek1 kernel: [7869346.437864] device eno2 left promiscuous mode
Apr 17 13:02:56 zeek1 kernel: [7870012.817070] device eno2 entered promiscuous mode
Apr 17 14:24:41 zeek1 kernel: [7874917.609929] device eno2 left promiscuous mode
Apr 17 14:27:06 zeek1 kernel: [7875062.144208] device eno2 entered promiscuous mode
Apr 17 14:33:20 zeek1 kernel: [7875436.202459] device eno2 left promiscuous mode
Apr 17 15:08:14 zeek1 kernel: [7877530.289709] ata2.00: exception Emask 0x0 SAct 0x60000078 SErr 0x0 action 0x0
Apr 17 15:08:14 zeek1 kernel: [7877530.289770] ata2.00: irq_stat 0x40000008
Apr 17 15:08:14 zeek1 kernel: [7877530.289797] ata2.00: failed command: READ FPDMA QUEUED
Apr 17 15:08:14 zeek1 kernel: [7877530.289819] ata2.00: cmd 60/08:f0:48:39:15/00:00:e7:00:00/40 tag 30 ncq dma 4096 in
Apr 17 15:08:14 zeek1 kernel: [7877530.289819] res 43/40:08:4b:39:15/00:00:e7:00:00/00 Emask 0x409 (media error) <F>
Apr 17 15:08:14 zeek1 kernel: [7877530.289882] ata2.00: status: { DRDY SENSE ERR }
Apr 17 15:08:14 zeek1 kernel: [7877530.289902] ata2.00: error: { UNC }
Apr 17 15:08:14 zeek1 kernel: [7877530.291436] ata2.00: configured for UDMA/133
Apr 17 15:08:14 zeek1 kernel: [7877530.291499] sd 1:0:0:0: [sdb] tag#30 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=2s
Apr 17 15:08:14 zeek1 kernel: [7877530.291508] sd 1:0:0:0: [sdb] tag#30 Sense Key : Medium Error [current]
Apr 17 15:08:14 zeek1 kernel: [7877530.291515] sd 1:0:0:0: [sdb] tag#30 Add. Sense: Unrecovered read error - auto reallocate failed
Apr 17 15:08:14 zeek1 kernel: [7877530.291521] sd 1:0:0:0: [sdb] tag#30 CDB: Read(10) 28 00 e7 15 39 48 00 00 08 00
Apr 17 15:08:14 zeek1 kernel: [7877530.291524] blk_update_request: I/O error, dev sdb, sector 3876927819 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Apr 17 15:08:14 zeek1 kernel: [7877530.292600] ata2: EH complete
Apr 17 15:08:16 zeek1 kernel: [7877532.463072] ata2.00: exception Emask 0x0 SAct 0x400003 SErr 0x0 action 0x0
Apr 17 15:08:16 zeek1 kernel: [7877532.464098] ata2.00: irq_stat 0x40000008
Apr 17 15:08:16 zeek1 kernel: [7877532.465024] ata2.00: failed command: READ FPDMA QUEUED
Apr 17 15:08:16 zeek1 kernel: [7877532.465938] ata2.00: cmd 60/08:08:48:39:15/00:00:e7:00:00/40 tag 1 ncq dma 4096 in
Apr 17 15:08:16 zeek1 kernel: [7877532.465938] res 43/40:08:4b:39:15/00:00:e7:00:00/00 Emask 0x409 (media error) <F>
Apr 17 15:08:16 zeek1 kernel: [7877532.467793] ata2.00: status: { DRDY SENSE ERR }
Apr 17 15:08:16 zeek1 kernel: [7877532.468711] ata2.00: error: { UNC }
Apr 17 15:08:16 zeek1 kernel: [7877532.470919] ata2.00: configured for UDMA/133
Apr 17 15:08:16 zeek1 kernel: [7877532.470961] sd 1:0:0:0: [sdb] tag#1 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=2s
Apr 17 15:08:16 zeek1 kernel: [7877532.470971] sd 1:0:0:0: [sdb] tag#1 Sense Key : Medium Error [current]
Apr 17 15:08:16 zeek1 kernel: [7877532.470978] sd 1:0:0:0: [sdb] tag#1 Add. Sense: Unrecovered read error - auto reallocate failed
Apr 17 15:08:16 zeek1 kernel: [7877532.470985] sd 1:0:0:0: [sdb] tag#1 CDB: Read(10) 28 00 e7 15 39 48 00 00 08 00
Apr 17 15:08:16 zeek1 kernel: [7877532.470987] blk_update_request: I/O error, dev sdb, sector 3876927819 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Apr 17 15:08:16 zeek1 kernel: [7877532.472065] ata2: EH complete
I don't know what JDK version I'm running... I can't seem to find any installed packages called 'java' or 'openjdk' etc.
Since this seems to be a Java problem and not an Elasticsearch problem, what's the best way to track it down?