Snashot getting stuck on a shard V2.3.1

Elasticsearch version: 2.3.1

Plugins installed: []
lucene version 5.5.0

JVM version: open jdk 1.8.0

OS version: CentOS

Description of the problem including expected versus actual behavior:

I have setup a cluster with two ElasticSearch servers and have the NFS repo setup on the first server(master) and have the NFS mount setup on these two servers. I created the snapshot repo and was able to take snapshots of individual small indices. When I try to take the snapshot on the entire indices set, the snapshot process gets stuck on a couple of shards in a particular index. I had to stop the Elastic Search servers and restart them to clear the stuck snapshot.
After restarting, when I tried taking the snapshot, it again got stuck in the same shards. It has been stuck for past 15 hours.

Please see below to see the stuck shards '3' & '4'

"2": {
"stage": "DONE",
"stats": {
"number_of_files": 207,
"processed_files": 207,
"total_size_in_bytes": 50471623283,
"processed_size_in_bytes": 50471623283,
"start_time_in_millis": 1491948418067,
"time_in_millis": 2135013
}
},
"3": {
"stage": "STARTED",
"stats": {
"number_of_files": 204,
"processed_files": 190,
"total_size_in_bytes": 43726103499,
"processed_size_in_bytes": 32915497923,
"start_time_in_millis": 1491948498451,
"time_in_millis": 0
},
"node": "ohMx7BUXRfyym0YaTlpreQ"
},
"4": {
"stage": "STARTED",
"stats": {
"number_of_files": 211,
"processed_files": 187,
"total_size_in_bytes": 51847398788,
"processed_size_in_bytes": 40983724276,
"start_time_in_millis": 1491948421440,
"time_in_millis": 0
},
"node": "inro3uspRw68FigfTxxu3Q"

Provide logs (if relevant):

Describe the feature:

@surekhabalaji do you have any relevant messages in the log files for the master node and/or the other node that you can share?

Also, have you tried restarting the NFS daemon on both machines to see if that resolves the problem?

Do the snapshots always get stuck at the same place during snapshotting? For example, for shard 3, does it always get stuck after processing 190 of the 204 files? Is stuck there or just making progress very slowly (e.g. moving to processing 191 of the 204 files after a long while)?

@abeyad, Could not find any relevant errors in the log files. When I tried restarting the NFS service first time when this issue happened, it made the ElasticSearch server to be hung as snapshotting was still in progress. As it was hung and I had to restart the node, i coudl not see where exactly it was stuck.

So, now before i could restart NFS daemon, is there a way to release the stuck snapshot?

If we aren't seeing any issues in Elasticsearch logs, my guess is the issues are inside NFS. What version of NFS are you on and what version of CentOS? Do you have any system logs that indicate an issue?

So, now before i could restart NFS daemon, is there a way to release the stuck snapshot?

Yes, delete the snapshot, which should remove it from the "stuck" state.

CentOS - Version 7 Linux
NFS - version 4

I dont see any error logs on the var/log/messages.
I issued the 'delete' snapshot command and it has been running for an hour now.

Strange. Is it possible to share your logs from the two nodes in the cluster with me? You can email it to me, my email is my first name at elastic.co

The issue got resolved after we restarted the nodes and restarted ElasticSearch servers. Thanks for the help. We can close this discussion.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.