How can I hunt down this Java error: "a fault occurred in an unsafe memory access operation"

I have a two-node Elasticsearch cluster with both nodes running 8.13.2. Since I added the second node to the cluster, the master node has been crashing regularly. The logs show:

[2024-04-17T15:57:09,258][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [zeek1] fatal error in thread [elasticsearch[zeek1][generic][T#3]], exiting
java.lang.InternalError: a fault occurred in an unsafe memory access operation
        at org.elasticsearch.indices.recovery.MultiChunkTransfer.handleItems(MultiChunkTransfer.java:126) ~[elasticsearch-8.13.2.jar:?]
        at org.elasticsearch.indices.recovery.MultiChunkTransfer$1.write(MultiChunkTransfer.java:72) ~[elasticsearch-8.13.2.jar:?]
        at org.elasticsearch.common.util.concurrent.AsyncIOProcessor.processList(AsyncIOProcessor.java:97) ~[elasticsearch-8.13.2.jar:?]
        at org.elasticsearch.common.util.concurrent.AsyncIOProcessor.drainAndProcessAndRelease(AsyncIOProcessor.java:85) ~[elasticsearch-8.13.2.jar:?]
        at org.elasticsearch.common.util.concurrent.AsyncIOProcessor.put(AsyncIOProcessor.java:76) ~[elasticsearch-8.13.2.jar:?]
        at org.elasticsearch.indices.recovery.MultiChunkTransfer.addItem(MultiChunkTransfer.java:83) ~[elasticsearch-8.13.2.jar:?]
        at org.elasticsearch.indices.recovery.MultiChunkTransfer.lambda$handleItems$4(MultiChunkTransfer.java:120) ~[elasticsearch-8.13.2.jar:?]
        at org.elasticsearch.action.ActionListener$2.onResponse(ActionListener.java:171) ~[elasticsearch-8.13.2.jar:?]
        at org.elasticsearch.action.ActionListenerImplementations$RunBeforeActionListener.onResponse(ActionListenerImplementations.java:307) ~[elasticsearch-8.13.2.jar:?]
        at org.elasticsearch.action.ActionListenerImplementations$MappedActionListener.onResponse(ActionListenerImplementations.java:95) ~[elasticsearch-8.13.2.jar:?]
        at org.elasticsearch.action.ActionListenerImplementations$RunBeforeActionListener.onResponse(ActionListenerImplementations.java:307) ~[elasticsearch-8.13.2.jar:?]
        at org.elasticsearch.action.ActionListenerImplementations$RunBeforeActionListener.onResponse(ActionListenerImplementations.java:307) ~[elasticsearch-8.13.2.jar:?]
        at org.elasticsearch.action.support.RetryableAction$RetryingListener.onResponse(RetryableAction.java:149) ~[elasticsearch-8.13.2.jar:?]
        at org.elasticsearch.action.ActionListenerImplementations$RunBeforeActionListener.onResponse(ActionListenerImplementations.java:307) ~[elasticsearch-8.13.2.jar:?]
        at org.elasticsearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:48) ~[elasticsearch-8.13.2.jar:?]
        at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1465) ~[elasticsearch-8.13.2.jar:?]
        at org.elasticsearch.transport.InboundHandler.doHandleResponse(InboundHandler.java:433) ~[elasticsearch-8.13.2.jar:?]
        at org.elasticsearch.transport.InboundHandler$2.doRun(InboundHandler.java:390) ~[elasticsearch-8.13.2.jar:?]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:984) ~[elasticsearch-8.13.2.jar:?]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-8.13.2.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
        at java.lang.Thread.run(Thread.java:1583) ~[?:?]

In a previous thread where someone else had a similar problem, it was suggested that low disk space (or low memory) could be the culprit. I don't have either of these problems.

In the other thread, someone recommended checking /var/log/kern.log. Here's the contents of that log file:

Apr 17 12:38:37 zeek1 kernel: [7868553.586827] audit: type=1400 audit(1713371917.646:19): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="nvidia_modprobe//kmod" pid=3973329 comm="apparmor_parser"
Apr 17 12:38:37 zeek1 kernel: [7868553.697567] audit: type=1400 audit(1713371917.754:20): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="/usr/sbin/chronyd" pid=3973333 comm="apparmor_parser"
Apr 17 12:38:37 zeek1 kernel: [7868553.772613] audit: type=1400 audit(1713371917.830:21): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="tcpdump" pid=3973331 comm="apparmor_parser"
Apr 17 12:38:37 zeek1 kernel: [7868553.811950] audit: type=1400 audit(1713371917.870:22): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/usr/lib/NetworkManager/nm-dhcp-client.action" pid=3973330 comm="apparmor_parser"
Apr 17 12:38:37 zeek1 kernel: [7868553.814463] audit: type=1400 audit(1713371917.870:23): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/usr/lib/NetworkManager/nm-dhcp-helper" pid=3973330 comm="apparmor_parser"
Apr 17 12:38:57 zeek1 kernel: [7868573.492336] kauditd_printk_skb: 4 callbacks suppressed
Apr 17 12:38:57 zeek1 kernel: [7868573.492343] audit: type=1400 audit(1713371937.550:28): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/usr/lib/snapd/snap-confine" pid=3973762 comm="apparmor_parser"
Apr 17 12:38:57 zeek1 kernel: [7868573.527528] audit: type=1400 audit(1713371937.586:29): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="/usr/lib/snapd/snap-confine//mount-namespace-capture-helper" pid=3973762 comm="apparmor_parser"
Apr 17 12:39:00 zeek1 kernel: [7868576.128680] loop0: detected capacity change from 0 to 8
Apr 17 12:39:00 zeek1 kernel: [7868576.702014] audit: type=1400 audit(1713371940.758:30): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="/usr/lib/snapd/snap-confine" pid=3973945 comm="apparmor_parser"
Apr 17 12:39:00 zeek1 kernel: [7868576.704220] audit: type=1400 audit(1713371940.762:31): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="/usr/lib/snapd/snap-confine//mount-namespace-capture-helper" pid=3973945 comm="apparmor_parser"
Apr 17 12:50:26 zeek1 kernel: [7869262.421219] device eno2 entered promiscuous mode
Apr 17 12:51:50 zeek1 kernel: [7869346.437864] device eno2 left promiscuous mode
Apr 17 13:02:56 zeek1 kernel: [7870012.817070] device eno2 entered promiscuous mode
Apr 17 14:24:41 zeek1 kernel: [7874917.609929] device eno2 left promiscuous mode
Apr 17 14:27:06 zeek1 kernel: [7875062.144208] device eno2 entered promiscuous mode
Apr 17 14:33:20 zeek1 kernel: [7875436.202459] device eno2 left promiscuous mode
Apr 17 15:08:14 zeek1 kernel: [7877530.289709] ata2.00: exception Emask 0x0 SAct 0x60000078 SErr 0x0 action 0x0
Apr 17 15:08:14 zeek1 kernel: [7877530.289770] ata2.00: irq_stat 0x40000008
Apr 17 15:08:14 zeek1 kernel: [7877530.289797] ata2.00: failed command: READ FPDMA QUEUED
Apr 17 15:08:14 zeek1 kernel: [7877530.289819] ata2.00: cmd 60/08:f0:48:39:15/00:00:e7:00:00/40 tag 30 ncq dma 4096 in
Apr 17 15:08:14 zeek1 kernel: [7877530.289819]          res 43/40:08:4b:39:15/00:00:e7:00:00/00 Emask 0x409 (media error) <F>
Apr 17 15:08:14 zeek1 kernel: [7877530.289882] ata2.00: status: { DRDY SENSE ERR }
Apr 17 15:08:14 zeek1 kernel: [7877530.289902] ata2.00: error: { UNC }
Apr 17 15:08:14 zeek1 kernel: [7877530.291436] ata2.00: configured for UDMA/133
Apr 17 15:08:14 zeek1 kernel: [7877530.291499] sd 1:0:0:0: [sdb] tag#30 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=2s
Apr 17 15:08:14 zeek1 kernel: [7877530.291508] sd 1:0:0:0: [sdb] tag#30 Sense Key : Medium Error [current]
Apr 17 15:08:14 zeek1 kernel: [7877530.291515] sd 1:0:0:0: [sdb] tag#30 Add. Sense: Unrecovered read error - auto reallocate failed
Apr 17 15:08:14 zeek1 kernel: [7877530.291521] sd 1:0:0:0: [sdb] tag#30 CDB: Read(10) 28 00 e7 15 39 48 00 00 08 00
Apr 17 15:08:14 zeek1 kernel: [7877530.291524] blk_update_request: I/O error, dev sdb, sector 3876927819 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Apr 17 15:08:14 zeek1 kernel: [7877530.292600] ata2: EH complete
Apr 17 15:08:16 zeek1 kernel: [7877532.463072] ata2.00: exception Emask 0x0 SAct 0x400003 SErr 0x0 action 0x0
Apr 17 15:08:16 zeek1 kernel: [7877532.464098] ata2.00: irq_stat 0x40000008
Apr 17 15:08:16 zeek1 kernel: [7877532.465024] ata2.00: failed command: READ FPDMA QUEUED
Apr 17 15:08:16 zeek1 kernel: [7877532.465938] ata2.00: cmd 60/08:08:48:39:15/00:00:e7:00:00/40 tag 1 ncq dma 4096 in
Apr 17 15:08:16 zeek1 kernel: [7877532.465938]          res 43/40:08:4b:39:15/00:00:e7:00:00/00 Emask 0x409 (media error) <F>
Apr 17 15:08:16 zeek1 kernel: [7877532.467793] ata2.00: status: { DRDY SENSE ERR }
Apr 17 15:08:16 zeek1 kernel: [7877532.468711] ata2.00: error: { UNC }
Apr 17 15:08:16 zeek1 kernel: [7877532.470919] ata2.00: configured for UDMA/133
Apr 17 15:08:16 zeek1 kernel: [7877532.470961] sd 1:0:0:0: [sdb] tag#1 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=2s
Apr 17 15:08:16 zeek1 kernel: [7877532.470971] sd 1:0:0:0: [sdb] tag#1 Sense Key : Medium Error [current]
Apr 17 15:08:16 zeek1 kernel: [7877532.470978] sd 1:0:0:0: [sdb] tag#1 Add. Sense: Unrecovered read error - auto reallocate failed
Apr 17 15:08:16 zeek1 kernel: [7877532.470985] sd 1:0:0:0: [sdb] tag#1 CDB: Read(10) 28 00 e7 15 39 48 00 00 08 00
Apr 17 15:08:16 zeek1 kernel: [7877532.470987] blk_update_request: I/O error, dev sdb, sector 3876927819 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Apr 17 15:08:16 zeek1 kernel: [7877532.472065] ata2: EH complete

I don't know what JDK version I'm running... I can't seem to find any installed packages called 'java' or 'openjdk' etc.

Since this seems to be a Java problem and not an Elasticsearch problem, what's the best way to track it down?

It's neither, it's faulty storage:

ata2.00: exception Emask 0x0 SAct 0x60000078 SErr 0x0 action 0x0
ata2.00: irq_stat 0x40000008
ata2.00: failed command: READ FPDMA QUEUED
ata2.00: cmd 60/08:f0:48:39:15/00:00:e7:00:00/40 tag 30 ncq dma 4096 in
         res 43/40:08:4b:39:15/00:00:e7:00:00/00 Emask 0x409 (media error) <F>
ata2.00: status: { DRDY SENSE ERR }
ata2.00: error: { UNC }
ata2.00: configured for UDMA/133
sd 1:0:0:0: [sdb] tag#30 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=2s
sd 1:0:0:0: [sdb] tag#30 Sense Key : Medium Error [current]
sd 1:0:0:0: [sdb] tag#30 Add. Sense: Unrecovered read error - auto reallocate failed
sd 1:0:0:0: [sdb] tag#30 CDB: Read(10) 28 00 e7 15 39 48 00 00 08 00
blk_update_request: I/O error, dev sdb, sector 3876927819 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
ata2: EH complete

You'll need to replace the disk.

Aha, thank you.

I verified this by running skdump: the result showed (in part):

Overall Status: BAD_SECTOR

and:

197 current-pending-sector

I don't have a backup of the data. My path.data: is on a different physical disk; if I replace that disk will I need to take any special precautions, or will the indices get rebuilt automatically when I mount the new drive in the same place as the old?

If the data is still there after the fix then Elasticsearch will recover without issue. Hopefully whatever was broken was isolated to the one disk and it hasn't contaminated anything else.

That's the problem. I have RAID 0 on that volume, so the data will be lost.

Ah in that case Elasticsearch will do its best to recover any lost shards from the other nodes in the cluster, although there's a chance that it won't be able to repair everything.

Understood. Thanks very much for your help!