I'm running Elasticsearch 7.14.0. I have a cluster with 3 nodes: CWS-CWHELP, CWS-CWHELP2, and CWS-CWHELP3. 1 and 3 are fine, but 2 is having problems I cannot explain. When it runs, and the cluster tries to allocate shards to it, I get the following error, which recurs 2-3 times per second:
[2024-03-27T14:28:54,266][WARN ][o.e.i.c.IndicesClusterStateService] [CWS-CWHELP2] [onlinehelp][4] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [onlinehelp][4]: Recovery failed from {CWS-CWHELP}{md51sqxmRBq6DPOQ2oI2Gw}{GCYUQIFmQ3iO38CdZna8JQ}{CWS-CWHELP}{10.50.2.27:4358}{cdfhilmrstw}{ml.machine_memory=4294361088, ml.max_open_jobs=512, xpack.installed=true, ml.max_jvm_size=1073741824, transform.node=true} into {CWS-CWHELP2}{FAncNZxMQ3C9lcNWhWd3hA}{lUKD01ALRY6hyyp2rz6dEw}{CWS-CWHELP2}{10.50.2.5:4358}{cdfhilmrstw}{ml.machine_memory=4294361088, xpack.installed=true, transform.node=true, ml.max_open_jobs=512, ml.max_jvm_size=1073741824}
at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$RecoveryResponseHandler.handleException(PeerRecoveryTargetService.java:638) [elasticsearch-7.14.0.jar:7.14.0]
at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1288) [elasticsearch-7.14.0.jar:7.14.0]
at org.elasticsearch.transport.InboundHandler.lambda$handleException$3(InboundHandler.java:313) [elasticsearch-7.14.0.jar:7.14.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:673) [elasticsearch-7.14.0.jar:7.14.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
at java.lang.Thread.run(Thread.java:831) [?:?]
Caused by: org.elasticsearch.transport.RemoteTransportException: [CWS-CWHELP][10.50.2.27:4358][internal:index/shard/recovery/start_recovery]
Caused by: org.elasticsearch.transport.RemoteTransportException: [File corruption occurred on recovery but checksums are ok]
Suppressed: org.elasticsearch.transport.RemoteTransportException: [CWS-CWHELP2][10.50.2.5:4358][internal:index/shard/recovery/file_chunk]
Caused by: org.elasticsearch.common.util.concurrent.UncategorizedExecutionException: Failed execution
at org.elasticsearch.common.util.concurrent.FutureUtils.rethrowExecutionException(FutureUtils.java:80) ~[elasticsearch-7.14.0.jar:7.14.0]
at org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:72) ~[elasticsearch-7.14.0.jar:7.14.0]
at org.elasticsearch.common.util.concurrent.ListenableFuture.notifyListenerDirectly(ListenableFuture.java:112) ~[elasticsearch-7.14.0.jar:7.14.0]
at org.elasticsearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:100) ~[elasticsearch-7.14.0.jar:7.14.0]
at org.elasticsearch.common.util.concurrent.BaseFuture.setException(BaseFuture.java:151) ~[elasticsearch-7.14.0.jar:7.14.0]
at org.elasticsearch.common.util.concurrent.ListenableFuture.onFailure(ListenableFuture.java:147) ~[elasticsearch-7.14.0.jar:7.14.0]
at org.elasticsearch.indices.recovery.RecoveryTarget.writeFileChunk(RecoveryTarget.java:497) ~[elasticsearch-7.14.0.jar:7.14.0]
at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$FileChunkTransportRequestHandler.messageReceived(PeerRecoveryTargetService.java:467) ~[elasticsearch-7.14.0.jar:7.14.0]
at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$FileChunkTransportRequestHandler.messageReceived(PeerRecoveryTargetService.java:437) ~[elasticsearch-7.14.0.jar:7.14.0]
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:61) ~[elasticsearch-7.14.0.jar:7.14.0]
at org.elasticsearch.transport.InboundHandler$1.doRun(InboundHandler.java:212) ~[elasticsearch-7.14.0.jar:7.14.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:732) ~[elasticsearch-7.14.0.jar:7.14.0]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-7.14.0.jar:7.14.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
at java.lang.Thread.run(Thread.java:831) [?:?]
Caused by: org.elasticsearch.common.io.stream.NotSerializableExceptionWrapper: execution_exception: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=15hvejs actual=1489bk (resource=name [_1sc_Lucene84_0.doc], length [154680], checksum [15hvejs], writtenBy [8.9.0]) (resource=VerifyingIndexOutput(_1sc_Lucene84_0.doc))
at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.getValue(BaseFuture.java:262) ~[elasticsearch-7.14.0.jar:7.14.0]
at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:235) ~[elasticsearch-7.14.0.jar:7.14.0]
at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:54) ~[elasticsearch-7.14.0.jar:7.14.0]
at org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:65) ~[elasticsearch-7.14.0.jar:7.14.0]
at org.elasticsearch.common.util.concurrent.ListenableFuture.notifyListenerDirectly(ListenableFuture.java:112) ~[elasticsearch-7.14.0.jar:7.14.0]
at org.elasticsearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:100) ~[elasticsearch-7.14.0.jar:7.14.0]
at org.elasticsearch.common.util.concurrent.BaseFuture.setException(BaseFuture.java:151) ~[elasticsearch-7.14.0.jar:7.14.0]
at org.elasticsearch.common.util.concurrent.ListenableFuture.onFailure(ListenableFuture.java:147) ~[elasticsearch-7.14.0.jar:7.14.0]
at org.elasticsearch.indices.recovery.RecoveryTarget.writeFileChunk(RecoveryTarget.java:497) ~[elasticsearch-7.14.0.jar:7.14.0]
at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$FileChunkTransportRequestHandler.messageReceived(PeerRecoveryTargetService.java:467) ~[elasticsearch-7.14.0.jar:7.14.0]
at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$FileChunkTransportRequestHandler.messageReceived(PeerRecoveryTargetService.java:437) ~[elasticsearch-7.14.0.jar:7.14.0]
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:61) ~[elasticsearch-7.14.0.jar:7.14.0]
at org.elasticsearch.transport.InboundHandler$1.doRun(InboundHandler.java:212) ~[elasticsearch-7.14.0.jar:7.14.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:732) ~[elasticsearch-7.14.0.jar:7.14.0]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-7.14.0.jar:7.14.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) ~[?:?]
at java.lang.Thread.run(Thread.java:831) ~[?:?]
Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=15hvejs actual=1489bk (resource=name [_1sc_Lucene84_0.doc], length [154680], checksum [15hvejs], writtenBy [8.9.0]) (resource=VerifyingIndexOutput(_1sc_Lucene84_0.doc))
at org.elasticsearch.index.store.Store$LuceneVerifyingIndexOutput.readAndCompareChecksum(Store.java:1204) ~[elasticsearch-7.14.0.jar:7.14.0]
at org.elasticsearch.index.store.Store$LuceneVerifyingIndexOutput.writeByte(Store.java:1182) ~[elasticsearch-7.14.0.jar:7.14.0]
at org.elasticsearch.index.store.Store$LuceneVerifyingIndexOutput.writeBytes(Store.java:1212) ~[elasticsearch-7.14.0.jar:7.14.0]
at org.elasticsearch.indices.recovery.MultiFileWriter.innerWriteFileChunk(MultiFileWriter.java:122) ~[elasticsearch-7.14.0.jar:7.14.0]
at org.elasticsearch.indices.recovery.MultiFileWriter.access$000(MultiFileWriter.java:37) ~[elasticsearch-7.14.0.jar:7.14.0]
at org.elasticsearch.indices.recovery.MultiFileWriter$FileChunkWriter.writeChunk(MultiFileWriter.java:216) ~[elasticsearch-7.14.0.jar:7.14.0]
at org.elasticsearch.indices.recovery.MultiFileWriter.writeFileChunk(MultiFileWriter.java:67) ~[elasticsearch-7.14.0.jar:7.14.0]
at org.elasticsearch.indices.recovery.RecoveryTarget.writeFileChunk(RecoveryTarget.java:494) ~[elasticsearch-7.14.0.jar:7.14.0]
at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$FileChunkTransportRequestHandler.messageReceived(PeerRecoveryTargetService.java:467) ~[elasticsearch-7.14.0.jar:7.14.0]
at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$FileChunkTransportRequestHandler.messageReceived(PeerRecoveryTargetService.java:437) ~[elasticsearch-7.14.0.jar:7.14.0]
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:61) ~[elasticsearch-7.14.0.jar:7.14.0]
at org.elasticsearch.transport.InboundHandler$1.doRun(InboundHandler.java:212) ~[elasticsearch-7.14.0.jar:7.14.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:732) ~[elasticsearch-7.14.0.jar:7.14.0]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-7.14.0.jar:7.14.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) ~[?:?]
at java.lang.Thread.run(Thread.java:831) ~[?:?]
I'm guessing the 4 in "[CWS-CWHELP2] [onlinehelp][4]" refers to the shard number. Curiously, in all the times I see this error, that number is only 2-4. Shards 0 and 1 for whatever reason don't appear.
I tried deleting the data directory and restarting the node. I tried a fresh install of Elasticsearch. Still, these errors persist. I've spent days googling these errors and trying various fixes, but I just can't figure it out. Any ideas?