Snapshot got stuck in IN_PROGRESS state

Hi,

We noticed that one of our snapshots is stuck in IN_PROGRESS state for a while now:

GET _cat/snapshots/es_backup

es-snapshot-1     SUCCESS 1615184123 06:15:23 1615184133 06:15:33  9.4s 17 17 0 17
es-snapshot-2   SUCCESS 1615184736 06:25:36 1615184737 06:25:37    1s 17 17 0 17
es-snapshot-3 IN_PROGRESS 1615196754 09:45:54 0          00:00:00  4.1h  9  0 0  0

In the logs I can see "failed to snapshot shard" warnings:

{"type":"log","host":"es-data-1","level":"WARN","systemid":"2106a117733f42d697284fbc54927928","system":"XXX","time": "2021-03-08T09:40:44.024Z","logger":"o.e.t.OutboundHandler","timezone":"UTC","marker":"[es-data-1] ","log":{"message":"send message failed [channel: Netty4TcpChannel{localAddress=/10.10.15.10:9300, remoteAddress=/10.10.15.10:3768}]"}}
java.lang.NullPointerException: null
	at org.elasticsearch.common.io.stream.StreamOutput.writeString(StreamOutput.java:435) ~[elasticsearch-7.8.0.jar:7.8.0]
	at com.floragunn.searchguard.support.ModuleInfo.writeTo(ModuleInfo.java:99) ~[?:?]
org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:240) [elasticsearch-7.8.0.jar:7.8.0]
	at java.lang.Thread.run(Thread.java:834) [?:?]
{"type":"log","host":"es-data-1","level":"WARN","systemid":"2106a117733f42d697284fbc54927928","system":"XXXX","time": "2021-03-08T09:47:29.995Z","logger":"o.e.s.SnapshotShardsService","timezone":"UTC","marker":"[es-data-1] ","log":{"message":"[[fluentdxxx-2021.03.08][0]][es_backup:es-snapshot-2021.03.08-09:45:53/8MwIgSn2Tcu4FJxEQ9bUQg] failed to snapshot shard"}}
....
.
.
.
Caused by: java.io.IOException: Message too long
	at java.io.FileDescriptor.close0(Native Method) ~[?:?]
	at java.io.FileDescriptor.close(FileDescriptor.java:297) ~[?:?]
	at java.io.FileDescriptor$1.close(FileDescriptor.java:88) ~[?:?]
	at sun.nio.ch.FileChannelImpl$Closer.run(FileChannelImpl.java:106) ~[?:?]
	at jdk.internal.ref.CleanerImpl$PhantomCleanableRef.performCleanup(CleanerImpl.java:186) ~[?:?]
	at jdk.internal.ref.PhantomCleanable.clean(PhantomCleanable.java:133) ~[?:?]
	at sun.nio.ch.FileChannelImpl.implCloseChannel(FileChannelImpl.java:198) ~[?:?]
... org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.8.0.jar:7.8.0]
	... 3 more
{"type":"log","host":"es-data-1","level":"WARN","systemid":"2106a117733f42d697284fbc54927928","system":"XXXX","time": "2021-03-08T09:49:04.202Z","logger":"o.e.s.SnapshotShardsService","timezone":"UTC","marker":"[es-data-1] ","log":{"message":"[[fluentd-xxx-2021.03.08][0]][es_backup:es-snapshot-2021.03.08-09:45:53/8MwIgSn2Tcu4FJxEQ9bUQg] failed to snapshot shard"}}
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
...
org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:76) ~[elasticsearch-7.8.0.jar:7.8.0]
	... 20 more
Caused by: java.io.IOException: Message too long
	at java.io.FileDescriptor.close0(Native Method) ~[?:?]
	at java.io.FileDescriptor.close(FileDescriptor.java:297) ~[?:?]
	at java.io.FileDescriptor$1.close(FileDescriptor.java:88) ~[?:?]
	at sun.nio.ch.FileChannelImpl$Closer.run(FileChannelImpl.java:106) ~[?:?]
	at jdk.internal.ref.CleanerImpl$PhantomCleanableRef.performCleanup(CleanerImpl.java:186) ~[?:?]
	at jdk.internal.ref.PhantomCleanable.clean(PhantomCleanable.java:133) ~[?:?]
	at sun.nio.ch.FileChannelImpl.implCloseChannel(FileChannelImpl.java:198) ~[?:?]
	at java.nio.channels.spi.AbstractInterruptibleChannel.close(AbstractInterruptibleChannel.java:112) ~[?:?]
	at java.nio.channels.Channels$1.close(Channels.java:177) ~[?:?]
	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:106) ~[elasticsearch-core-7.8.0.jar:7.8.0]
	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:76) ~[elasticsearch-core-7.8.0.jar:7.8.0]
	at org.elasticsearch.core.internal.io.Streams.copy(Streams.java:54) ~[elasticsearch-core-7.8.0.jar:7.8.0]
	at org.elasticsearch.common.blobstore.fs.FsBlobContainer.writeBlob(FsBlobContainer.java:184) ~[elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotFile(BlobStoreRepository.java:2156) ~[elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.lambda$snapshotShard$58(BlobStoreRepository.java:1827) ~[elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.action.ActionRunnable$1.doRun(ActionRunnable.java:45) ~[elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:695) ~[elasticsearch-7.8.0.jar:7.8.0]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.8.0.jar:7.8.0]
	... 3 more

After a lot of time snapshot creation finished with the following errors:

{
  "snapshots" : [
    {
      "snapshot" : "es-snapshot-2021.03.08-09:45:53",
      "uuid" : "8MwIgSn2Tcu4FJxEQ9bUQg",
      "version_id" : 7080099,
      "version" : "7.8.0",
      "indices" : [
        "fluentd-hij-2021.03.08",
        ".kibana_1",
        "fluentd-ghi-2021.03.08",
        "fluentd-fgh-admin-ns-legacy-2021.03.08",
        "fluentd-efg-2021.03.08",
        "fluentd-def-2021.03.08",
        "fluentd-cde-2021.03.08",
        "fluentd-bcd-2021.03.08",
        "fluentd-abc-2021.03.08"
      ],
      "include_global_state" : true,
      "state" : "PARTIAL",
      "start_time" : "2021-03-08T09:45:54.107Z",
      "start_time_in_millis" : 1615196754107,
      "end_time" : "2021-03-08T14:30:19.746Z",
      "end_time_in_millis" : 1615213819746,
      "duration_in_millis" : 17065639,
      "failures" : [
        {
          "index" : "fluentd-def-2021.03.08",
          "index_uuid" : "fluentd-def-2021.03.08",
          "shard_id" : 0,
          "reason" : "UncategorizedExecutionException[Failed execution]; nested: ExecutionException[java.io.IOException: Message too long]; nested: IOException[Message too long]",
          "node_id" : "1cn3bxwYTbqlLjuznkqNKw",
          "status" : "INTERNAL_SERVER_ERROR"
        },

Why is this? Can anyone help me resolve this issue?

Thanks in advance,
Akshat

You're using a third-party security plugin which is apparently doing something which results in a NullPointerException. Best to speak to the plugin developer about that.

This suggests to me that your shared filesystem is using UDP (e.g. it's NFS) and is trying to send packets larger than your network's MTU. I think NFS can take a long time to return errors if misconfigured like that, and Elasticsearch will be waiting for those errors before proceeding.