Morning,
We have an elasticsearch cluster consist of (3 master, 12 data)
one of our data nodes keeps restarting due to the following error in logs.
[2024-01-29T10:52:57,011][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [node12.sc.po] fatal error in thread [elasticsearch[node12.sc.po][generic][T#7]], exiting
java.lang.InternalError: a fault occurred in an unsafe memory access operation
at org.elasticsearch.indices.recovery.RecoverySourceHandler$3.nextChunkRequest(RecoverySourceHandler.java:1389) ~[elasticsearch-8.10.4.jar:?]
at org.elasticsearch.indices.recovery.RecoverySourceHandler$3.nextChunkRequest(RecoverySourceHandler.java:1358) ~[elasticsearch-8.10.4.jar:?]
at org.elasticsearch.indices.recovery.MultiChunkTransfer.getNextRequest(MultiChunkTransfer.java:163) ~[elasticsearch-8.10.4.jar:?]
at org.elasticsearch.indices.recovery.MultiChunkTransfer.handleItems(MultiChunkTransfer.java:126) ~[elasticsearch-8.10.4.jar:?]
at org.elasticsearch.indices.recovery.MultiChunkTransfer$1.write(MultiChunkTransfer.java:72) ~[elasticsearch-8.10.4.jar:?]
at org.elasticsearch.common.util.concurrent.AsyncIOProcessor.processList(AsyncIOProcessor.java:97) ~[elasticsearch-8.10.4.jar:?]
at org.elasticsearch.common.util.concurrent.AsyncIOProcessor.drainAndProcessAndRelease(AsyncIOProcessor.java:85) ~[elasticsearch-8.10.4.jar:?]
at org.elasticsearch.common.util.concurrent.AsyncIOProcessor.put(AsyncIOProcessor.java:73) ~[elasticsearch-8.10.4.jar:?]
at org.elasticsearch.indices.recovery.MultiChunkTransfer.addItem(MultiChunkTransfer.java:83) ~[elasticsearch-8.10.4.jar:?]
at org.elasticsearch.indices.recovery.MultiChunkTransfer.lambda$handleItems$4(MultiChunkTransfer.java:120) ~[elasticsearch-8.10.4.jar:?]
at org.elasticsearch.action.ActionListener$2.onResponse(ActionListener.java:177) ~[elasticsearch-8.10.4.jar:?]
at org.elasticsearch.action.ActionListenerImplementations$RunBeforeActionListener.onResponse(ActionListenerImplementations.java:298) ~[elasticsearch-8.10.4.jar:?]
at org.elasticsearch.action.ActionListenerImplementations$MappedActionListener.onResponse(ActionListenerImplementations.java:95) ~[elasticsearch-8.10.4.jar:?]
at org.elasticsearch.action.ActionListenerImplementations$RunBeforeActionListener.onResponse(ActionListenerImplementations.java:298) ~[elasticsearch-8.10.4.jar:?]
at org.elasticsearch.action.ActionListenerImplementations$RunBeforeActionListener.onResponse(ActionListenerImplementations.java:298) ~[elasticsearch-8.10.4.jar:?]
at org.elasticsearch.action.support.RetryableAction$RetryingListener.onResponse(RetryableAction.java:169) ~[elasticsearch-8.10.4.jar:?]
at org.elasticsearch.action.ActionListenerImplementations$RunBeforeActionListener.onResponse(ActionListenerImplementations.java:298) ~[elasticsearch-8.10.4.jar:?]
at org.elasticsearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:49) ~[elasticsearch-8.10.4.jar:?]
at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1414) ~[elasticsearch-8.10.4.jar:?]
at org.elasticsearch.transport.InboundHandler.doHandleResponse(InboundHandler.java:398) ~[elasticsearch-8.10.4.jar:?]
at org.elasticsearch.transport.InboundHandler$2.doRun(InboundHandler.java:355) ~[elasticsearch-8.10.4.jar:?]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:983) ~[elasticsearch-8.10.4.jar:?]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-8.10.4.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
at java.lang.Thread.run(Thread.java:1583) ~[?:?]
after doing some research i suspected that it's caused by HDD failure so i check dmesg and noticed that every time the ES error occures i get the following dmesg error at the same time
[Mon Jan 29 10:53:46 2024] sd 0:0:8:0: [sdi] tag#1 Sense Key : Medium Error [current]
[Mon Jan 29 10:53:46 2024] sd 0:0:8:0: [sdi] tag#1 Add. Sense: Read retries exhausted
[Mon Jan 29 10:53:46 2024] sd 0:0:8:0: [sdi] tag#1 CDB: Read(10) 28 00 06 3a 07 60 00 00 08 00
[Mon Jan 29 10:53:46 2024] blk_update_request: critical medium error, dev sdi, sector 104466272
So i did multiple smartctl tests on /dev/sdi and all of them passed, I'm looking for any kind of help to get ES started on this node.
in case of a hardware failure how should i proceed in solving the issue since i can't start the node to vote it out of the cluster then delete "/data" directory ,replace faulty HDD and reinstall ES and rejoin the server to the cluster.
also is there anyway for me to get my cluster to exclude/remove the affected node without it being able to start ES.