I've done a fair bit of searching and I can't come up with a fix here
A curator snapshot in our cluster has become stuck (around 7TB of data).
About the cluster
- ES 2.4.6
- Running on AWS with the cloud plugin
Here's what I have tried so far
- delete API (obviously fails)
- rolling restart of all nodes
- replaced all nodes with new nodes (data and master), no change
- checked for failed shards on the data nodes, there are none
I am now at a loss, and I am unable to snapshot the indices... Any help?
What version of Curator are you using with Elasticsearch 2.4.6?
Regardless, Curator only makes API calls. If there is a stall, it’s inside Elasticsearch. You might have to clear out a stalled snapshot using other API calls.
Running 4.3.1 curator, any thoughts on that? We understood it was not 'safe' to use later version
Thanks, aware that curator makes the calls only.
At this point, we've exhausted all the API calls that are available
- DELETE won't function when a snapshot is IN PROGRESS
- can't delete a repository when its in progress (snapshotting to S3)
- There are no active tasks or pending tasks
(and as mentioned have tried restarts as well as complete replacements)
Only thing left I can find any information on is a complete shutdown of the cluster ... which in an active cluster with 3+ searches per second, means downtime certainly
DELETE is the way to stop an IN_PROGRESS snapshot. If that isn’t working, I can only surmise that something else is amiss in the cluster. A full or rolling restart is probably your next step. An upgrade to a newer release is also recommended as the 2.x series is no longer supported. Neither is the 5.x series.
Yes, we had thought so also, but nothing indicates that (_cluster/state shows everything is fine, no ABORTED shards, etc).
_cat/shards also confirms all shards are started
As I mentioned above, I have done a full rolling restart of every node, as well as swapped out every node in the cluster.
Upgrade isn't on the cards right now unfortunately.
Looks like we are now able to DELETE the affected job (despite originally not being able too)
This is approximately 6 hours post complete replacement of the nodes (the time is likely irrelevant)
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.