Snapshot got stuck in IN_PROGRESS state

akshatsharma · April 8, 2021, 12:50pm

Hi,

We noticed that one of our snapshots is stuck in IN_PROGRESS state for a while now:

GET _cat/snapshots/es_backup

es-snapshot-1     SUCCESS 1615184123 06:15:23 1615184133 06:15:33  9.4s 17 17 0 17
es-snapshot-2   SUCCESS 1615184736 06:25:36 1615184737 06:25:37    1s 17 17 0 17
es-snapshot-3 IN_PROGRESS 1615196754 09:45:54 0          00:00:00  4.1h  9  0 0  0

In the logs I can see "failed to snapshot shard" warnings:

{"type":"log","host":"es-data-1","level":"WARN","systemid":"2106a117733f42d697284fbc54927928","system":"XXX","time": "2021-03-08T09:40:44.024Z","logger":"o.e.t.OutboundHandler","timezone":"UTC","marker":"[es-data-1] ","log":{"message":"send message failed [channel: Netty4TcpChannel{localAddress=/10.10.15.10:9300, remoteAddress=/10.10.15.10:3768}]"}}
java.lang.NullPointerException: null
	at org.elasticsearch.common.io.stream.StreamOutput.writeString(StreamOutput.java:435) ~[elasticsearch-7.8.0.jar:7.8.0]
	at com.floragunn.searchguard.support.ModuleInfo.writeTo(ModuleInfo.java:99) ~[?:?]
org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:240) [elasticsearch-7.8.0.jar:7.8.0]
	at java.lang.Thread.run(Thread.java:834) [?:?]
{"type":"log","host":"es-data-1","level":"WARN","systemid":"2106a117733f42d697284fbc54927928","system":"XXXX","time": "2021-03-08T09:47:29.995Z","logger":"o.e.s.SnapshotShardsService","timezone":"UTC","marker":"[es-data-1] ","log":{"message":"[[fluentdxxx-2021.03.08][0]][es_backup:es-snapshot-2021.03.08-09:45:53/8MwIgSn2Tcu4FJxEQ9bUQg] failed to snapshot shard"}}
....
.
.
.
Caused by: java.io.IOException: Message too long
	at java.io.FileDescriptor.close0(Native Method) ~[?:?]
	at java.io.FileDescriptor.close(FileDescriptor.java:297) ~[?:?]
	at java.io.FileDescriptor$1.close(FileDescriptor.java:88) ~[?:?]
	at sun.nio.ch.FileChannelImpl$Closer.run(FileChannelImpl.java:106) ~[?:?]
	at jdk.internal.ref.CleanerImpl$PhantomCleanableRef.performCleanup(CleanerImpl.java:186) ~[?:?]
	at jdk.internal.ref.PhantomCleanable.clean(PhantomCleanable.java:133) ~[?:?]
	at sun.nio.ch.FileChannelImpl.implCloseChannel(FileChannelImpl.java:198) ~[?:?]
... org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.8.0.jar:7.8.0]
	... 3 more
{"type":"log","host":"es-data-1","level":"WARN","systemid":"2106a117733f42d697284fbc54927928","system":"XXXX","time": "2021-03-08T09:49:04.202Z","logger":"o.e.s.SnapshotShardsService","timezone":"UTC","marker":"[es-data-1] ","log":{"message":"[[fluentd-xxx-2021.03.08][0]][es_backup:es-snapshot-2021.03.08-09:45:53/8MwIgSn2Tcu4FJxEQ9bUQg] failed to snapshot shard"}}
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
...
org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:76) ~[elasticsearch-7.8.0.jar:7.8.0]
	... 20 more
Caused by: java.io.IOException: Message too long
	at java.io.FileDescriptor.close0(Native Method) ~[?:?]
	at java.io.FileDescriptor.close(FileDescriptor.java:297) ~[?:?]
	at java.io.FileDescriptor$1.close(FileDescriptor.java:88) ~[?:?]
	at sun.nio.ch.FileChannelImpl$Closer.run(FileChannelImpl.java:106) ~[?:?]
	at jdk.internal.ref.CleanerImpl$PhantomCleanableRef.performCleanup(CleanerImpl.java:186) ~[?:?]
	at jdk.internal.ref.PhantomCleanable.clean(PhantomCleanable.java:133) ~[?:?]
	at sun.nio.ch.FileChannelImpl.implCloseChannel(FileChannelImpl.java:198) ~[?:?]
	at java.nio.channels.spi.AbstractInterruptibleChannel.close(AbstractInterruptibleChannel.java:112) ~[?:?]
	at java.nio.channels.Channels$1.close(Channels.java:177) ~[?:?]
	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:106) ~[elasticsearch-core-7.8.0.jar:7.8.0]
	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:76) ~[elasticsearch-core-7.8.0.jar:7.8.0]
	at org.elasticsearch.core.internal.io.Streams.copy(Streams.java:54) ~[elasticsearch-core-7.8.0.jar:7.8.0]
	at org.elasticsearch.common.blobstore.fs.FsBlobContainer.writeBlob(FsBlobContainer.java:184) ~[elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotFile(BlobStoreRepository.java:2156) ~[elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.lambda$snapshotShard$58(BlobStoreRepository.java:1827) ~[elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.action.ActionRunnable$1.doRun(ActionRunnable.java:45) ~[elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:695) ~[elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.8.0.jar:7.8.0]
	... 3 more

After a lot of time snapshot creation finished with the following errors:

{
  "snapshots" : [
    {
      "snapshot" : "es-snapshot-2021.03.08-09:45:53",
      "uuid" : "8MwIgSn2Tcu4FJxEQ9bUQg",
      "version_id" : 7080099,
      "version" : "7.8.0",
      "indices" : [
        "fluentd-hij-2021.03.08",
        ".kibana_1",
        "fluentd-ghi-2021.03.08",
        "fluentd-fgh-admin-ns-legacy-2021.03.08",
        "fluentd-efg-2021.03.08",
        "fluentd-def-2021.03.08",
        "fluentd-cde-2021.03.08",
        "fluentd-bcd-2021.03.08",
        "fluentd-abc-2021.03.08"
      ],
      "include_global_state" : true,
      "state" : "PARTIAL",
      "start_time" : "2021-03-08T09:45:54.107Z",
      "start_time_in_millis" : 1615196754107,
      "end_time" : "2021-03-08T14:30:19.746Z",
      "end_time_in_millis" : 1615213819746,
      "duration_in_millis" : 17065639,
      "failures" : [
        {
          "index" : "fluentd-def-2021.03.08",
          "index_uuid" : "fluentd-def-2021.03.08",
          "shard_id" : 0,
          "reason" : "UncategorizedExecutionException[Failed execution]; nested: ExecutionException[java.io.IOException: Message too long]; nested: IOException[Message too long]",
          "node_id" : "1cn3bxwYTbqlLjuznkqNKw",
          "status" : "INTERNAL_SERVER_ERROR"
        },

Why is this? Can anyone help me resolve this issue?

Thanks in advance,
Akshat

DavidTurner · April 8, 2021, 2:20pm

akshatsharma:

java.lang.NullPointerException: null
	at org.elasticsearch.common.io.stream.StreamOutput.writeString(StreamOutput.java:435) ~[elasticsearch-7.8.0.jar:7.8.0]
	at com.floragunn.searchguard.support.ModuleInfo.writeTo(ModuleInfo.java:99) ~[?:?]

You're using a third-party security plugin which is apparently doing something which results in a NullPointerException. Best to speak to the plugin developer about that.

This suggests to me that your shared filesystem is using UDP (e.g. it's NFS) and is trying to send packets larger than your network's MTU. I think NFS can take a long time to return errors if misconfigured like that, and Elasticsearch will be waiting for those errors before proceeding.

system · May 6, 2021, 2:20pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Snapshot is stuck in IN_PROGRESS Elasticsearch	5	3773	July 6, 2017
Snapshot stuck in repo Elasticsearch	2	594	August 20, 2017
Snapshot stuck IN_PROGRESS Elasticsearch	4	1445	October 15, 2018
Snapshot process stuck in one last shard Elasticsearch	5	961	July 5, 2017
Snapshot restore process is not finished Elasticsearch	4	2738	July 6, 2017

Snapshot got stuck in IN_PROGRESS state

Related topics