Failed Snapshot with Pending Tasks

kjbweb · September 1, 2016, 11:38am

Hi There,

We're running ES version 1.5.2 across 3 data nodes and 3 clients with around 900GB of data.

In the process of migrating to a new (larger) cluster we attempted to create a snapshot, however during this process it became apparent that we were going to run out of disk space on the shared mount, and as such I issued an abort command.

Due to the pending processes already queue up, the abort/delete command didn't get processed before the mount hit 100% usage, as such ES threw the following exception:

[2016-08-31 15:22:24,546][WARN ][snapshots                ] [node name] [[blehblehbleh][0]] [manual_snapshot:snapshot_31-08-2016] failed to create snapshot
org.elasticsearch.index.snapshots.IndexShardSnapshotFailedException: [phoenix_reporting_20150126_8c3d2e02a][0] Failed to perform snapshot (index files)
        at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$SnapshotContext.snapshot(BlobStoreIndexShardRepository.java:502)
        at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository.snapshot(BlobStoreIndexShardRepository.java:140)
        at org.elasticsearch.index.snapshots.IndexShardSnapshotAndRestoreService.snapshot(IndexShardSnapshotAndRestoreService.java:85)
        at org.elasticsearch.snapshots.SnapshotsService$5.run(SnapshotsService.java:817)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: No space left on device
        at java.io.FileOutputStream.close0(Native Method)
        at java.io.FileOutputStream.close(FileOutputStream.java:393)
        at java.io.FilterOutputStream.close(FilterOutputStream.java:160)
        at org.elasticsearch.common.blobstore.fs.FsBlobContainer$1.close(FsBlobContainer.java:100)
        at java.io.FilterOutputStream.close(FilterOutputStream.java:160)
        at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$SnapshotContext.snapshotFile(BlobStoreIndexShardRepository.java:559)
        at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$SnapshotContext.snapshot(BlobStoreIndexShardRepository.java:500)
        ... 6 more

I've since added more space to the mount though the cluster still has the backup in the "STARTED" state, and a whole bunch of pending tasks have built up such as the following:

151871375 21.8h NORMAL update snapshot state 
151871632 21.8h NORMAL update snapshot state 
151869124 21.9h NORMAL update snapshot state 
151869189 21.9h NORMAL update snapshot state 
151869320 21.9h NORMAL update snapshot state 
151869446 21.9h NORMAL update snapshot state 
151869575 21.9h NORMAL update snapshot state 
151869703 21.9h NORMAL update snapshot state 
151869833 21.9h NORMAL update snapshot state 
151869959 21.9h NORMAL update snapshot state

I can't delete the repository, or abort the snapshot at this point, moving snapshot that'd been created out of the mount hasn't helped things either.

Is there anything I can do here besides restarting the cluster at this point?

Thanks in advance.