Elasticsearch keeps restarting

Morning,

We have an elasticsearch cluster consist of (3 master, 12 data)
one of our data nodes keeps restarting due to the following error in logs.

[2024-01-29T10:52:57,011][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [node12.sc.po] fatal error in thread [elasticsearch[node12.sc.po][generic][T#7]], exiting
java.lang.InternalError: a fault occurred in an unsafe memory access operation
        at org.elasticsearch.indices.recovery.RecoverySourceHandler$3.nextChunkRequest(RecoverySourceHandler.java:1389) ~[elasticsearch-8.10.4.jar:?]
        at org.elasticsearch.indices.recovery.RecoverySourceHandler$3.nextChunkRequest(RecoverySourceHandler.java:1358) ~[elasticsearch-8.10.4.jar:?]
        at org.elasticsearch.indices.recovery.MultiChunkTransfer.getNextRequest(MultiChunkTransfer.java:163) ~[elasticsearch-8.10.4.jar:?]
        at org.elasticsearch.indices.recovery.MultiChunkTransfer.handleItems(MultiChunkTransfer.java:126) ~[elasticsearch-8.10.4.jar:?]
        at org.elasticsearch.indices.recovery.MultiChunkTransfer$1.write(MultiChunkTransfer.java:72) ~[elasticsearch-8.10.4.jar:?]
        at org.elasticsearch.common.util.concurrent.AsyncIOProcessor.processList(AsyncIOProcessor.java:97) ~[elasticsearch-8.10.4.jar:?]
        at org.elasticsearch.common.util.concurrent.AsyncIOProcessor.drainAndProcessAndRelease(AsyncIOProcessor.java:85) ~[elasticsearch-8.10.4.jar:?]
        at org.elasticsearch.common.util.concurrent.AsyncIOProcessor.put(AsyncIOProcessor.java:73) ~[elasticsearch-8.10.4.jar:?]
        at org.elasticsearch.indices.recovery.MultiChunkTransfer.addItem(MultiChunkTransfer.java:83) ~[elasticsearch-8.10.4.jar:?]
        at org.elasticsearch.indices.recovery.MultiChunkTransfer.lambda$handleItems$4(MultiChunkTransfer.java:120) ~[elasticsearch-8.10.4.jar:?]
        at org.elasticsearch.action.ActionListener$2.onResponse(ActionListener.java:177) ~[elasticsearch-8.10.4.jar:?]
        at org.elasticsearch.action.ActionListenerImplementations$RunBeforeActionListener.onResponse(ActionListenerImplementations.java:298) ~[elasticsearch-8.10.4.jar:?]
        at org.elasticsearch.action.ActionListenerImplementations$MappedActionListener.onResponse(ActionListenerImplementations.java:95) ~[elasticsearch-8.10.4.jar:?]
        at org.elasticsearch.action.ActionListenerImplementations$RunBeforeActionListener.onResponse(ActionListenerImplementations.java:298) ~[elasticsearch-8.10.4.jar:?]
        at org.elasticsearch.action.ActionListenerImplementations$RunBeforeActionListener.onResponse(ActionListenerImplementations.java:298) ~[elasticsearch-8.10.4.jar:?]
        at org.elasticsearch.action.support.RetryableAction$RetryingListener.onResponse(RetryableAction.java:169) ~[elasticsearch-8.10.4.jar:?]
        at org.elasticsearch.action.ActionListenerImplementations$RunBeforeActionListener.onResponse(ActionListenerImplementations.java:298) ~[elasticsearch-8.10.4.jar:?]
        at org.elasticsearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:49) ~[elasticsearch-8.10.4.jar:?]
        at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1414) ~[elasticsearch-8.10.4.jar:?]
        at org.elasticsearch.transport.InboundHandler.doHandleResponse(InboundHandler.java:398) ~[elasticsearch-8.10.4.jar:?]
        at org.elasticsearch.transport.InboundHandler$2.doRun(InboundHandler.java:355) ~[elasticsearch-8.10.4.jar:?]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:983) ~[elasticsearch-8.10.4.jar:?]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-8.10.4.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
        at java.lang.Thread.run(Thread.java:1583) ~[?:?]

after doing some research i suspected that it's caused by HDD failure so i check dmesg and noticed that every time the ES error occures i get the following dmesg error at the same time

[Mon Jan 29 10:53:46 2024] sd 0:0:8:0: [sdi] tag#1 Sense Key : Medium Error [current]
[Mon Jan 29 10:53:46 2024] sd 0:0:8:0: [sdi] tag#1 Add. Sense: Read retries exhausted
[Mon Jan 29 10:53:46 2024] sd 0:0:8:0: [sdi] tag#1 CDB: Read(10) 28 00 06 3a 07 60 00 00 08 00
[Mon Jan 29 10:53:46 2024] blk_update_request: critical medium error, dev sdi, sector 104466272

So i did multiple smartctl tests on /dev/sdi and all of them passed, I'm looking for any kind of help to get ES started on this node.

in case of a hardware failure how should i proceed in solving the issue since i can't start the node to vote it out of the cluster then delete "/data" directory ,replace faulty HDD and reinstall ES and rejoin the server to the cluster.

also is there anyway for me to get my cluster to exclude/remove the affected node without it being able to start ES.

I can confirm that the disk failed on some sectors, and we plan on replacing it.

What's the steps that I have to do to get the cluster back and running without issues after we replace it?

My concern is will ES start and work after we replace the failing HDD, even though some data will basically get deleted/vanish with the old HDD.

Hi there,

If your cluster is yellow, then you do actually have all your data. You might be missing replicas but your data is intact.

I am not 100% sure what you plan to do with your disks, but if you are replacing them, and attempt to restart Elasticsearch on that node, it should rejoin the cluster as it it was a new node.

Elasticsearch will them go through a balancing process and eventually the cluster will go green.

I have had a similar issues in the past where due to corruption of the data, I had to remove the data directory completely. On restart the node re-joined as if it were new and it was all back to normal over a couple of hours.

I hope that helps.

My cluster is in RED state but that's okay i know which data we are going to lose.

After replacing the disks and starting elasticsearch the node will join as a new node with a new UUID assuming that li have to clear the /data path, my question is will the cluster just forget about the old UUID?

Hi there,

A cluster has a single UUID. Just do a curl http://<node_name>:9200 to find out. If the configuration is as before, the node will join the cluster as if it were a new node and in any case the cluster UUID is stored in the data folder.

As I mentioned, we have had to do this a couple of times and on each occasion it has been successful.

Elasticsearch is very resilient when it comes to these sorts of things.

If you have a DEV cluster you could always experiment and see what happens.

With regards to your RED cluster, I hope you have backups in place that will help mitigate any loss.

Thanks for the help, we will try that once they replace the faulty HDD.

We can't backup the indices that are stored on this node, they are from our IDS and the cost to backup that data is just not possible we have real-time alerts that runs off that data so i don't mind losing a couple of days worth of data from them.

Hi there,

Did you manage to fix the issue?

Yes, I had to replace the failed HDD, and re-join the server as a new node

Appreciate the help.