"File corruption occurred on recovery but checksums are ok"

CWS-Dan · March 27, 2024, 9:40pm

I'm running Elasticsearch 7.14.0. I have a cluster with 3 nodes: CWS-CWHELP, CWS-CWHELP2, and CWS-CWHELP3. 1 and 3 are fine, but 2 is having problems I cannot explain. When it runs, and the cluster tries to allocate shards to it, I get the following error, which recurs 2-3 times per second:

[2024-03-27T14:28:54,266][WARN ][o.e.i.c.IndicesClusterStateService] [CWS-CWHELP2] [onlinehelp][4] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [onlinehelp][4]: Recovery failed from {CWS-CWHELP}{md51sqxmRBq6DPOQ2oI2Gw}{GCYUQIFmQ3iO38CdZna8JQ}{CWS-CWHELP}{10.50.2.27:4358}{cdfhilmrstw}{ml.machine_memory=4294361088, ml.max_open_jobs=512, xpack.installed=true, ml.max_jvm_size=1073741824, transform.node=true} into {CWS-CWHELP2}{FAncNZxMQ3C9lcNWhWd3hA}{lUKD01ALRY6hyyp2rz6dEw}{CWS-CWHELP2}{10.50.2.5:4358}{cdfhilmrstw}{ml.machine_memory=4294361088, xpack.installed=true, transform.node=true, ml.max_open_jobs=512, ml.max_jvm_size=1073741824}
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$RecoveryResponseHandler.handleException(PeerRecoveryTargetService.java:638) [elasticsearch-7.14.0.jar:7.14.0]
	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1288) [elasticsearch-7.14.0.jar:7.14.0]
	at org.elasticsearch.transport.InboundHandler.lambda$handleException$3(InboundHandler.java:313) [elasticsearch-7.14.0.jar:7.14.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:673) [elasticsearch-7.14.0.jar:7.14.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
	at java.lang.Thread.run(Thread.java:831) [?:?]
Caused by: org.elasticsearch.transport.RemoteTransportException: [CWS-CWHELP][10.50.2.27:4358][internal:index/shard/recovery/start_recovery]
Caused by: org.elasticsearch.transport.RemoteTransportException: [File corruption occurred on recovery but checksums are ok]
	Suppressed: org.elasticsearch.transport.RemoteTransportException: [CWS-CWHELP2][10.50.2.5:4358][internal:index/shard/recovery/file_chunk]
	Caused by: org.elasticsearch.common.util.concurrent.UncategorizedExecutionException: Failed execution
		at org.elasticsearch.common.util.concurrent.FutureUtils.rethrowExecutionException(FutureUtils.java:80) ~[elasticsearch-7.14.0.jar:7.14.0]
		at org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:72) ~[elasticsearch-7.14.0.jar:7.14.0]
		at org.elasticsearch.common.util.concurrent.ListenableFuture.notifyListenerDirectly(ListenableFuture.java:112) ~[elasticsearch-7.14.0.jar:7.14.0]
		at org.elasticsearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:100) ~[elasticsearch-7.14.0.jar:7.14.0]
		at org.elasticsearch.common.util.concurrent.BaseFuture.setException(BaseFuture.java:151) ~[elasticsearch-7.14.0.jar:7.14.0]
		at org.elasticsearch.common.util.concurrent.ListenableFuture.onFailure(ListenableFuture.java:147) ~[elasticsearch-7.14.0.jar:7.14.0]
		at org.elasticsearch.indices.recovery.RecoveryTarget.writeFileChunk(RecoveryTarget.java:497) ~[elasticsearch-7.14.0.jar:7.14.0]
		at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$FileChunkTransportRequestHandler.messageReceived(PeerRecoveryTargetService.java:467) ~[elasticsearch-7.14.0.jar:7.14.0]
		at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$FileChunkTransportRequestHandler.messageReceived(PeerRecoveryTargetService.java:437) ~[elasticsearch-7.14.0.jar:7.14.0]
		at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:61) ~[elasticsearch-7.14.0.jar:7.14.0]
		at org.elasticsearch.transport.InboundHandler$1.doRun(InboundHandler.java:212) ~[elasticsearch-7.14.0.jar:7.14.0]
		at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:732) ~[elasticsearch-7.14.0.jar:7.14.0]
		at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-7.14.0.jar:7.14.0]
		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
		at java.lang.Thread.run(Thread.java:831) [?:?]
	Caused by: org.elasticsearch.common.io.stream.NotSerializableExceptionWrapper: execution_exception: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=15hvejs actual=1489bk (resource=name [_1sc_Lucene84_0.doc], length [154680], checksum [15hvejs], writtenBy [8.9.0]) (resource=VerifyingIndexOutput(_1sc_Lucene84_0.doc))
		at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.getValue(BaseFuture.java:262) ~[elasticsearch-7.14.0.jar:7.14.0]
		at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:235) ~[elasticsearch-7.14.0.jar:7.14.0]
		at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:54) ~[elasticsearch-7.14.0.jar:7.14.0]
		at org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:65) ~[elasticsearch-7.14.0.jar:7.14.0]
		at org.elasticsearch.common.util.concurrent.ListenableFuture.notifyListenerDirectly(ListenableFuture.java:112) ~[elasticsearch-7.14.0.jar:7.14.0]
		at org.elasticsearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:100) ~[elasticsearch-7.14.0.jar:7.14.0]
		at org.elasticsearch.common.util.concurrent.BaseFuture.setException(BaseFuture.java:151) ~[elasticsearch-7.14.0.jar:7.14.0]
		at org.elasticsearch.common.util.concurrent.ListenableFuture.onFailure(ListenableFuture.java:147) ~[elasticsearch-7.14.0.jar:7.14.0]
		at org.elasticsearch.indices.recovery.RecoveryTarget.writeFileChunk(RecoveryTarget.java:497) ~[elasticsearch-7.14.0.jar:7.14.0]
		at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$FileChunkTransportRequestHandler.messageReceived(PeerRecoveryTargetService.java:467) ~[elasticsearch-7.14.0.jar:7.14.0]
		at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$FileChunkTransportRequestHandler.messageReceived(PeerRecoveryTargetService.java:437) ~[elasticsearch-7.14.0.jar:7.14.0]
		at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:61) ~[elasticsearch-7.14.0.jar:7.14.0]
		at org.elasticsearch.transport.InboundHandler$1.doRun(InboundHandler.java:212) ~[elasticsearch-7.14.0.jar:7.14.0]
		at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:732) ~[elasticsearch-7.14.0.jar:7.14.0]
		at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-7.14.0.jar:7.14.0]
		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) ~[?:?]
		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) ~[?:?]
		at java.lang.Thread.run(Thread.java:831) ~[?:?]
	Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=15hvejs actual=1489bk (resource=name [_1sc_Lucene84_0.doc], length [154680], checksum [15hvejs], writtenBy [8.9.0]) (resource=VerifyingIndexOutput(_1sc_Lucene84_0.doc))
		at org.elasticsearch.index.store.Store$LuceneVerifyingIndexOutput.readAndCompareChecksum(Store.java:1204) ~[elasticsearch-7.14.0.jar:7.14.0]
		at org.elasticsearch.index.store.Store$LuceneVerifyingIndexOutput.writeByte(Store.java:1182) ~[elasticsearch-7.14.0.jar:7.14.0]
		at org.elasticsearch.index.store.Store$LuceneVerifyingIndexOutput.writeBytes(Store.java:1212) ~[elasticsearch-7.14.0.jar:7.14.0]
		at org.elasticsearch.indices.recovery.MultiFileWriter.innerWriteFileChunk(MultiFileWriter.java:122) ~[elasticsearch-7.14.0.jar:7.14.0]
		at org.elasticsearch.indices.recovery.MultiFileWriter.access$000(MultiFileWriter.java:37) ~[elasticsearch-7.14.0.jar:7.14.0]
		at org.elasticsearch.indices.recovery.MultiFileWriter$FileChunkWriter.writeChunk(MultiFileWriter.java:216) ~[elasticsearch-7.14.0.jar:7.14.0]
		at org.elasticsearch.indices.recovery.MultiFileWriter.writeFileChunk(MultiFileWriter.java:67) ~[elasticsearch-7.14.0.jar:7.14.0]
		at org.elasticsearch.indices.recovery.RecoveryTarget.writeFileChunk(RecoveryTarget.java:494) ~[elasticsearch-7.14.0.jar:7.14.0]
		at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$FileChunkTransportRequestHandler.messageReceived(PeerRecoveryTargetService.java:467) ~[elasticsearch-7.14.0.jar:7.14.0]
		at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$FileChunkTransportRequestHandler.messageReceived(PeerRecoveryTargetService.java:437) ~[elasticsearch-7.14.0.jar:7.14.0]
		at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:61) ~[elasticsearch-7.14.0.jar:7.14.0]
		at org.elasticsearch.transport.InboundHandler$1.doRun(InboundHandler.java:212) ~[elasticsearch-7.14.0.jar:7.14.0]
		at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:732) ~[elasticsearch-7.14.0.jar:7.14.0]
		at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-7.14.0.jar:7.14.0]
		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) ~[?:?]
		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) ~[?:?]
		at java.lang.Thread.run(Thread.java:831) ~[?:?]

I'm guessing the 4 in "[CWS-CWHELP2] [onlinehelp][4]" refers to the shard number. Curiously, in all the times I see this error, that number is only 2-4. Shards 0 and 1 for whatever reason don't appear.

I tried deleting the data directory and restarting the node. I tried a fresh install of Elasticsearch. Still, these errors persist. I've spent days googling these errors and trying various fixes, but I just can't figure it out. Any ideas?

DavidTurner · March 28, 2024, 10:52am

It means the data that CWS-CWHELP2 is receiving isn't the data that ES originally wrote, but CWS-CWHELP is sending the correct data, so that suggests something on your network is meddling with the data in transit. In any case it's a problem outside of Elasticsearch, see these docs for more info:

Also 7.14.0 is really old, long past EOL and wholly unsupported these days. You need to upgrade as a matter of urgency.

CWS-Dan · March 28, 2024, 2:53pm

Thank you for your help David, I will read the link. One more piece of info in the logs that I want to ask about, as it might help in identifying what data is being meddled with. I consistently see errors in the logs to this effect.

Invalid string; unexpected character: 145 hex: 91

Below is from CWS-CWHELP:

[2024-03-28T10:41:02,047][WARN ][o.e.c.r.a.AllocationService] [CWS-CWHELP] failing shard [failed shard, shard [onlinehelp][2], node[lewqBf60TxGn4fknUmmJNg], relocating [md51sqxmRBq6DPOQ2oI2Gw], [P], recovery_source[peer recovery], s[INITIALIZING], a[id=4veLLKB0Ty6cIvXGVq0adA, rId=t57nYJZfS2StPTYtkez4hQ], message [failed recovery], failure [RecoveryFailedException[[onlinehelp][2]: Recovery failed from {CWS-CWHELP}{md51sqxmRBq6DPOQ2oI2Gw}{GCYUQIFmQ3iO38CdZna8JQ}{CWS-CWHELP}{10.50.2.27:4358}{cdfhilmrstw}{ml.machine_memory=4294361088, ml.max_open_jobs=512, xpack.installed=true, ml.max_jvm_size=1073741824, transform.node=true} into {CWS-CWHELP2}{lewqBf60TxGn4fknUmmJNg}{2Lw5aUkuTxqrp0vkHMt_xw}{CWS-CWHELP2}{10.50.2.5:4358}{cdfhilmrstw}{ml.machine_memory=4294361088, xpack.installed=true, transform.node=true, ml.max_open_jobs=512, ml.max_jvm_size=1073741824}]; nested: RemoteTransportException[[CWS-CWHELP][10.50.2.27:4358][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[CWS-CWHELP2][10.50.2.5:4358][internal:index/shard/recovery/file_chunk]]; nested: IOException[**Invalid string; unexpected character: 145 hex: 91]**; ], markAsStale [true]]
org.elasticsearch.indices.recovery.RecoveryFailedException: [onlinehelp][2]: Recovery failed from {CWS-CWHELP}{md51sqxmRBq6DPOQ2oI2Gw}{GCYUQIFmQ3iO38CdZna8JQ}{CWS-CWHELP}{10.50.2.27:4358}{cdfhilmrstw}{ml.machine_memory=4294361088, ml.max_open_jobs=512, xpack.installed=true, ml.max_jvm_size=1073741824, transform.node=true} into {CWS-CWHELP2}{lewqBf60TxGn4fknUmmJNg}{2Lw5aUkuTxqrp0vkHMt_xw}{CWS-CWHELP2}{10.50.2.5:4358}{cdfhilmrstw}{ml.machine_memory=4294361088, xpack.installed=true, transform.node=true, ml.max_open_jobs=512, ml.max_jvm_size=1073741824}
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$RecoveryResponseHandler.handleException(PeerRecoveryTargetService.java:638) ~[elasticsearch-7.14.0.jar:7.14.0]
	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1288) ~[elasticsearch-7.14.0.jar:7.14.0]
	at org.elasticsearch.transport.InboundHandler.lambda$handleException$3(InboundHandler.java:313) ~[elasticsearch-7.14.0.jar:7.14.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:673) ~[elasticsearch-7.14.0.jar:7.14.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
	at java.lang.Thread.run(Thread.java:831) [?:?]
Caused by: org.elasticsearch.transport.RemoteTransportException: [CWS-CWHELP][10.50.2.27:4358][internal:index/shard/recovery/start_recovery]
Caused by: org.elasticsearch.transport.RemoteTransportException: [CWS-CWHELP2][10.50.2.5:4358][internal:index/shard/recovery/file_chunk]
Caused by: java.io.IOException: **Invalid string; unexpected character: 145 hex: 91**
	at org.elasticsearch.common.io.stream.StreamInput.throwOnBrokenChar(StreamInput.java:550) ~[elasticsearch-7.14.0.jar:7.14.0]
	at org.elasticsearch.common.io.stream.StreamInput.readString(StreamInput.java:509) ~[elasticsearch-7.14.0.jar:7.14.0]
	at org.elasticsearch.indices.recovery.RecoveryFileChunkRequest.<init>(RecoveryFileChunkRequest.java:35) ~[elasticsearch-7.14.0.jar:7.14.0]
	at org.elasticsearch.transport.RequestHandlerRegistry.newRequest(RequestHandlerRegistry.java:48) ~[elasticsearch-7.14.0.jar:7.14.0]
	at org.elasticsearch.transport.InboundHandler.handleRequest(InboundHandler.java:188) ~[elasticsearch-7.14.0.jar:7.14.0]
	at org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:100) ~[elasticsearch-7.14.0.jar:7.14.0]
	at org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:82) ~[elasticsearch-7.14.0.jar:7.14.0]
	at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:710) ~[elasticsearch-7.14.0.jar:7.14.0]
	at org.elasticsearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:129) ~[elasticsearch-7.14.0.jar:7.14.0]
	at org.elasticsearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:104) ~[elasticsearch-7.14.0.jar:7.14.0]
	at org.elasticsearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:69) ~[elasticsearch-7.14.0.jar:7.14.0]
	at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:63) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?]
	at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:271) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?]
	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?]
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) ~[?:?]
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:615) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:578) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) ~[?:?]
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) ~[?:?]
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?]
	... 1 more

At almost the exact same time, I see this in CWS-CWHELP2:

[2024-03-28T10:41:02,045][WARN ][o.e.i.c.IndicesClusterStateService] [CWS-CWHELP2] [onlinehelp][2] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [onlinehelp][2]: Recovery failed from {CWS-CWHELP}{md51sqxmRBq6DPOQ2oI2Gw}{GCYUQIFmQ3iO38CdZna8JQ}{CWS-CWHELP}{10.50.2.27:4358}{cdfhilmrstw}{ml.machine_memory=4294361088, ml.max_open_jobs=512, xpack.installed=true, ml.max_jvm_size=1073741824, transform.node=true} into {CWS-CWHELP2}{lewqBf60TxGn4fknUmmJNg}{2Lw5aUkuTxqrp0vkHMt_xw}{CWS-CWHELP2}{10.50.2.5:4358}{cdfhilmrstw}{ml.machine_memory=4294361088, xpack.installed=true, transform.node=true, ml.max_open_jobs=512, ml.max_jvm_size=1073741824}
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$RecoveryResponseHandler.handleException(PeerRecoveryTargetService.java:638) [elasticsearch-7.14.0.jar:7.14.0]
	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1288) [elasticsearch-7.14.0.jar:7.14.0]
	at org.elasticsearch.transport.InboundHandler.lambda$handleException$3(InboundHandler.java:313) [elasticsearch-7.14.0.jar:7.14.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:673) [elasticsearch-7.14.0.jar:7.14.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
	at java.lang.Thread.run(Thread.java:831) [?:?]
Caused by: org.elasticsearch.transport.RemoteTransportException: [CWS-CWHELP][10.50.2.27:4358][internal:index/shard/recovery/start_recovery]
Caused by: org.elasticsearch.transport.RemoteTransportException: [CWS-CWHELP2][10.50.2.5:4358][internal:index/shard/recovery/file_chunk]
Caused by: java.io.IOException: **Invalid string; unexpected character: 145 hex: 91**
	at org.elasticsearch.common.io.stream.StreamInput.throwOnBrokenChar(StreamInput.java:550) ~[elasticsearch-7.14.0.jar:7.14.0]
	at org.elasticsearch.common.io.stream.StreamInput.readString(StreamInput.java:509) ~[elasticsearch-7.14.0.jar:7.14.0]
	at org.elasticsearch.indices.recovery.RecoveryFileChunkRequest.<init>(RecoveryFileChunkRequest.java:35) ~[elasticsearch-7.14.0.jar:7.14.0]
	at org.elasticsearch.transport.RequestHandlerRegistry.newRequest(RequestHandlerRegistry.java:48) ~[elasticsearch-7.14.0.jar:7.14.0]
	at org.elasticsearch.transport.InboundHandler.handleRequest(InboundHandler.java:188) ~[elasticsearch-7.14.0.jar:7.14.0]
	at org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:100) ~[elasticsearch-7.14.0.jar:7.14.0]
	at org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:82) ~[elasticsearch-7.14.0.jar:7.14.0]
	at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:710) ~[elasticsearch-7.14.0.jar:7.14.0]
	at org.elasticsearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:129) ~[elasticsearch-7.14.0.jar:7.14.0]
	at org.elasticsearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:104) ~[elasticsearch-7.14.0.jar:7.14.0]
	at org.elasticsearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:69) ~[elasticsearch-7.14.0.jar:7.14.0]
	at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:63) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?]
	at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:271) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?]
	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?]
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) ~[?:?]
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:615) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:578) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) ~[?:?]
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) ~[?:?]
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?]
	at java.lang.Thread.run(Thread.java:831) ~[?:?]

It's not always that character specifically. I also see characters 254, 152, and 149. Is this a helpful data point, or is it completely arbitrary?

CWS-Dan · March 28, 2024, 4:03pm

Sorry, just one more line in the log that I think might be relevant to the problem.

Caused by: org.elasticsearch.common.io.stream.NotSerializableExceptionWrapper: execution_exception: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?)

These log excerpts are getting big and ugly, so I'll spare you the rest of the stack trace, unless you think it's important.

DavidTurner · March 28, 2024, 4:28pm

It doesn't, sorry, but this exception is also an indication that the data sent over the network by CWS-CWHELP is different from the data that CWS-CWHELP2 is receiving. Pretty sure there's something badly wrong with your network, but if you can't pin down the problem yourself you'll need to investigate that further with your network or infra folks, there's not a lot more we can do on the Elasticsearch side.

CWS-Dan · March 28, 2024, 5:48pm

Understood, I will definitely do that. Just one more question, which might be related to Elasticsearch's config, so you might be able to help. For each of our nodes, I have them communicating on ports 4357/4358. Here are the relevant settings in elasticsearch.yml, from CWS-CWHELP2:

http.port: 4357
transport.tcp.port: 4358
discovery.seed_hosts: ["cws-cwhelp:4358","cws-cwhelp3:4358"]

However, when I look in the logs, I can see it attempting to communicate on a port (62538) that isn't in the config:

[2024-03-28T11:41:36,847][WARN ][o.e.t.TcpTransport ] [CWS-CWHELP2] exception caught on transport layer [Netty4TcpChannel{localAddress=/10.50.2.5:62538, remoteAddress=CWS-CWHELP/10.50.2.27:4358, profile=default}], closing connection

Do you know why it would be doing that? Is there another setting I must change? Or do you think this incidental and not related to Elasticsearch?

DavidTurner · March 28, 2024, 11:26pm

That's normal for TCP, the outbound port of every connection is different and isn't configurable.

system · April 25, 2024, 11:27pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
File corruption occurred on recovery but checksums are ok Elasticsearch	5	396	October 6, 2021
Failed Shard Recovery Elasticsearch	5	3211	July 6, 2017
Corrupted Shard on Recovery Elasticsearch	10	721	July 6, 2017
Index shard got corrupted Elasticsearch	3	3146	July 6, 2017
Frequent shard failures Elasticsearch	7	811	July 20, 2023

"File corruption occurred on recovery but checksums are ok"

Related topics