Snashot getting stuck on a shard V2.3.1

surekhabalaji · April 12, 2017, 4:03pm

Elasticsearch version: 2.3.1

Plugins installed: []
lucene version 5.5.0

JVM version: open jdk 1.8.0

OS version: CentOS

Description of the problem including expected versus actual behavior:

I have setup a cluster with two ElasticSearch servers and have the NFS repo setup on the first server(master) and have the NFS mount setup on these two servers. I created the snapshot repo and was able to take snapshots of individual small indices. When I try to take the snapshot on the entire indices set, the snapshot process gets stuck on a couple of shards in a particular index. I had to stop the Elastic Search servers and restart them to clear the stuck snapshot.
After restarting, when I tried taking the snapshot, it again got stuck in the same shards. It has been stuck for past 15 hours.

Please see below to see the stuck shards '3' & '4'

"2": {
"stage": "DONE",
"stats": {
"number_of_files": 207,
"processed_files": 207,
"total_size_in_bytes": 50471623283,
"processed_size_in_bytes": 50471623283,
"start_time_in_millis": 1491948418067,
"time_in_millis": 2135013
}
},
"3": {
"stage": "STARTED",
"stats": {
"number_of_files": 204,
"processed_files": 190,
"total_size_in_bytes": 43726103499,
"processed_size_in_bytes": 32915497923,
"start_time_in_millis": 1491948498451,
"time_in_millis": 0
},
"node": "ohMx7BUXRfyym0YaTlpreQ"
},
"4": {
"stage": "STARTED",
"stats": {
"number_of_files": 211,
"processed_files": 187,
"total_size_in_bytes": 51847398788,
"processed_size_in_bytes": 40983724276,
"start_time_in_millis": 1491948421440,
"time_in_millis": 0
},
"node": "inro3uspRw68FigfTxxu3Q"

Provide logs (if relevant):

Describe the feature:

abeyad · April 12, 2017, 4:09pm

@surekhabalaji do you have any relevant messages in the log files for the master node and/or the other node that you can share?

Also, have you tried restarting the NFS daemon on both machines to see if that resolves the problem?

Do the snapshots always get stuck at the same place during snapshotting? For example, for shard 3, does it always get stuck after processing 190 of the 204 files? Is stuck there or just making progress very slowly (e.g. moving to processing 191 of the 204 files after a long while)?

surekhabalaji · April 12, 2017, 8:44pm

@abeyad, Could not find any relevant errors in the log files. When I tried restarting the NFS service first time when this issue happened, it made the ElasticSearch server to be hung as snapshotting was still in progress. As it was hung and I had to restart the node, i coudl not see where exactly it was stuck.

So, now before i could restart NFS daemon, is there a way to release the stuck snapshot?

abeyad · April 12, 2017, 9:01pm

If we aren't seeing any issues in Elasticsearch logs, my guess is the issues are inside NFS. What version of NFS are you on and what version of CentOS? Do you have any system logs that indicate an issue?

So, now before i could restart NFS daemon, is there a way to release the stuck snapshot?

Yes, delete the snapshot, which should remove it from the "stuck" state.

surekhabalaji · April 12, 2017, 9:49pm

CentOS - Version 7 Linux
NFS - version 4

I dont see any error logs on the var/log/messages.
I issued the 'delete' snapshot command and it has been running for an hour now.

abeyad · April 12, 2017, 10:01pm

Strange. Is it possible to share your logs from the two nodes in the cluster with me? You can email it to me, my email is my first name at elastic.co

surekhabalaji · April 19, 2017, 2:03pm

The issue got resolved after we restarted the nodes and restarted ElasticSearch servers. Thanks for the help. We can close this discussion.

system · May 17, 2017, 2:10pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Snapshot process stuck in one last shard Elasticsearch	5	961	July 5, 2017
Elasticsearch Snapshot Elasticsearch	6	420	June 28, 2018
Multiple Shards Stuck in INIT State Elasticsearch	5	733	October 21, 2017
One of my shards stuck in INITIALIZING Elasticsearch	5	3777	July 5, 2017
Each Concurrent shards batch takes more than an hour to get allocated after the node left Elasticsearch	8	2155	July 5, 2017

Snashot getting stuck on a shard V2.3.1

Related topics