Curator snapshot stuck in INIT

Tim_Curtin1 · May 26, 2019, 11:07am

I've done a fair bit of searching and I can't come up with a fix here

A curator snapshot in our cluster has become stuck (around 7TB of data).

About the cluster

ES 2.4.6
Running on AWS with the cloud plugin

Here's what I have tried so far

delete API (obviously fails)
rolling restart of all nodes
replaced all nodes with new nodes (data and master), no change
checked for failed shards on the data nodes, there are none

I am now at a loss, and I am unable to snapshot the indices... Any help?

theuntergeek · May 26, 2019, 12:07pm

What version of Curator are you using with Elasticsearch 2.4.6?

Regardless, Curator only makes API calls. If there is a stall, it’s inside Elasticsearch. You might have to clear out a stalled snapshot using other API calls.

Tim_Curtin1 · May 26, 2019, 8:56pm

Running 4.3.1 curator, any thoughts on that? We understood it was not 'safe' to use later version

Thanks, aware that curator makes the calls only.
At this point, we've exhausted all the API calls that are available

DELETE won't function when a snapshot is IN PROGRESS
can't delete a repository when its in progress (snapshotting to S3)
There are no active tasks or pending tasks
(and as mentioned have tried restarts as well as complete replacements)

Only thing left I can find any information on is a complete shutdown of the cluster ... which in an active cluster with 3+ searches per second, means downtime certainly

theuntergeek · May 26, 2019, 10:02pm

DELETE is the way to stop an IN_PROGRESS snapshot. If that isn’t working, I can only surmise that something else is amiss in the cluster. A full or rolling restart is probably your next step. An upgrade to a newer release is also recommended as the 2.x series is no longer supported. Neither is the 5.x series.

Tim_Curtin1 · May 26, 2019, 10:48pm

Yes, we had thought so also, but nothing indicates that (_cluster/state shows everything is fine, no ABORTED shards, etc).

_cat/shards also confirms all shards are started

As I mentioned above, I have done a full rolling restart of every node, as well as swapped out every node in the cluster.

Upgrade isn't on the cards right now unfortunately.

Tim_Curtin1 · May 26, 2019, 11:46pm

Looks like we are now able to DELETE the affected job (despite originally not being able too)

This is approximately 6 hours post complete replacement of the nodes (the time is likely irrelevant)

system · June 23, 2019, 11:46pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Snapshot stuck IN_PROGRESS Elasticsearch	4	1445	October 15, 2018
Curator 3.5.1 Snapshot issue Elasticsearch	7	931	April 11, 2017
Elasticsearch snapshot IN_PROGRESS for a long time. The delete snapshot API is also not working Elasticsearch	2	2434	November 6, 2019
Curator 4.2.6 showing Concurrent Snapshot Execution Exception? Elasticsearch	3	1084	July 13, 2017
Snapshot restore hanging on 6.4 Elasticsearch	7	1004	November 24, 2018

Curator snapshot stuck in INIT

Related topics