Hi,
We noticed that one of our snapshots is stuck in IN_PROGRESS state for a while now:
GET _cat/snapshots/es_backup
es-snapshot-1 SUCCESS 1615184123 06:15:23 1615184133 06:15:33 9.4s 17 17 0 17
es-snapshot-2 SUCCESS 1615184736 06:25:36 1615184737 06:25:37 1s 17 17 0 17
es-snapshot-3 IN_PROGRESS 1615196754 09:45:54 0 00:00:00 4.1h 9 0 0 0
In the logs I can see "failed to snapshot shard" warnings:
{"type":"log","host":"es-data-1","level":"WARN","systemid":"2106a117733f42d697284fbc54927928","system":"XXX","time": "2021-03-08T09:40:44.024Z","logger":"o.e.t.OutboundHandler","timezone":"UTC","marker":"[es-data-1] ","log":{"message":"send message failed [channel: Netty4TcpChannel{localAddress=/10.10.15.10:9300, remoteAddress=/10.10.15.10:3768}]"}}
java.lang.NullPointerException: null
at org.elasticsearch.common.io.stream.StreamOutput.writeString(StreamOutput.java:435) ~[elasticsearch-7.8.0.jar:7.8.0]
at com.floragunn.searchguard.support.ModuleInfo.writeTo(ModuleInfo.java:99) ~[?:?]
org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:240) [elasticsearch-7.8.0.jar:7.8.0]
at java.lang.Thread.run(Thread.java:834) [?:?]
{"type":"log","host":"es-data-1","level":"WARN","systemid":"2106a117733f42d697284fbc54927928","system":"XXXX","time": "2021-03-08T09:47:29.995Z","logger":"o.e.s.SnapshotShardsService","timezone":"UTC","marker":"[es-data-1] ","log":{"message":"[[fluentdxxx-2021.03.08][0]][es_backup:es-snapshot-2021.03.08-09:45:53/8MwIgSn2Tcu4FJxEQ9bUQg] failed to snapshot shard"}}
....
.
.
.
Caused by: java.io.IOException: Message too long
at java.io.FileDescriptor.close0(Native Method) ~[?:?]
at java.io.FileDescriptor.close(FileDescriptor.java:297) ~[?:?]
at java.io.FileDescriptor$1.close(FileDescriptor.java:88) ~[?:?]
at sun.nio.ch.FileChannelImpl$Closer.run(FileChannelImpl.java:106) ~[?:?]
at jdk.internal.ref.CleanerImpl$PhantomCleanableRef.performCleanup(CleanerImpl.java:186) ~[?:?]
at jdk.internal.ref.PhantomCleanable.clean(PhantomCleanable.java:133) ~[?:?]
at sun.nio.ch.FileChannelImpl.implCloseChannel(FileChannelImpl.java:198) ~[?:?]
... org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.8.0.jar:7.8.0]
... 3 more
{"type":"log","host":"es-data-1","level":"WARN","systemid":"2106a117733f42d697284fbc54927928","system":"XXXX","time": "2021-03-08T09:49:04.202Z","logger":"o.e.s.SnapshotShardsService","timezone":"UTC","marker":"[es-data-1] ","log":{"message":"[[fluentd-xxx-2021.03.08][0]][es_backup:es-snapshot-2021.03.08-09:45:53/8MwIgSn2Tcu4FJxEQ9bUQg] failed to snapshot shard"}}
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
...
org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:76) ~[elasticsearch-7.8.0.jar:7.8.0]
... 20 more
Caused by: java.io.IOException: Message too long
at java.io.FileDescriptor.close0(Native Method) ~[?:?]
at java.io.FileDescriptor.close(FileDescriptor.java:297) ~[?:?]
at java.io.FileDescriptor$1.close(FileDescriptor.java:88) ~[?:?]
at sun.nio.ch.FileChannelImpl$Closer.run(FileChannelImpl.java:106) ~[?:?]
at jdk.internal.ref.CleanerImpl$PhantomCleanableRef.performCleanup(CleanerImpl.java:186) ~[?:?]
at jdk.internal.ref.PhantomCleanable.clean(PhantomCleanable.java:133) ~[?:?]
at sun.nio.ch.FileChannelImpl.implCloseChannel(FileChannelImpl.java:198) ~[?:?]
at java.nio.channels.spi.AbstractInterruptibleChannel.close(AbstractInterruptibleChannel.java:112) ~[?:?]
at java.nio.channels.Channels$1.close(Channels.java:177) ~[?:?]
at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:106) ~[elasticsearch-core-7.8.0.jar:7.8.0]
at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:76) ~[elasticsearch-core-7.8.0.jar:7.8.0]
at org.elasticsearch.core.internal.io.Streams.copy(Streams.java:54) ~[elasticsearch-core-7.8.0.jar:7.8.0]
at org.elasticsearch.common.blobstore.fs.FsBlobContainer.writeBlob(FsBlobContainer.java:184) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotFile(BlobStoreRepository.java:2156) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository.lambda$snapshotShard$58(BlobStoreRepository.java:1827) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.action.ActionRunnable$1.doRun(ActionRunnable.java:45) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:695) ~[elasticsearch-7.8.0.jar:7.8.0]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.8.0.jar:7.8.0]
... 3 more
After a lot of time snapshot creation finished with the following errors:
{
"snapshots" : [
{
"snapshot" : "es-snapshot-2021.03.08-09:45:53",
"uuid" : "8MwIgSn2Tcu4FJxEQ9bUQg",
"version_id" : 7080099,
"version" : "7.8.0",
"indices" : [
"fluentd-hij-2021.03.08",
".kibana_1",
"fluentd-ghi-2021.03.08",
"fluentd-fgh-admin-ns-legacy-2021.03.08",
"fluentd-efg-2021.03.08",
"fluentd-def-2021.03.08",
"fluentd-cde-2021.03.08",
"fluentd-bcd-2021.03.08",
"fluentd-abc-2021.03.08"
],
"include_global_state" : true,
"state" : "PARTIAL",
"start_time" : "2021-03-08T09:45:54.107Z",
"start_time_in_millis" : 1615196754107,
"end_time" : "2021-03-08T14:30:19.746Z",
"end_time_in_millis" : 1615213819746,
"duration_in_millis" : 17065639,
"failures" : [
{
"index" : "fluentd-def-2021.03.08",
"index_uuid" : "fluentd-def-2021.03.08",
"shard_id" : 0,
"reason" : "UncategorizedExecutionException[Failed execution]; nested: ExecutionException[java.io.IOException: Message too long]; nested: IOException[Message too long]",
"node_id" : "1cn3bxwYTbqlLjuznkqNKw",
"status" : "INTERNAL_SERVER_ERROR"
},
Why is this? Can anyone help me resolve this issue?
Thanks in advance,
Akshat