If you abort this kind of stuck snapshot (by deleting it), does it eventually stop properly?
What are those 89 failed shards? Why did they fail? (can you share logs or the concrete failures?)
What to do? Is it possible rerun?
Aborting and running the snapshot again seems like the best option here if the snapshot isn't making any progress at all. If it's making some progress, letting it finish and the running another snapshot will be faster due to the incremental nature of snapshots. Even if you have some failures during one snapshot, the data it put in the repository will be reused by the next snapshot you run where possible so even a partially failed snapshot contributes progress to future snapshots.
Likely this would've sufficed to fix the issue as the snapshot implementation is designed to be resilient to errors. As a tip for next time you run into any trouble, I'd try this first before moving to other more time-consuming work-arounds
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.