Timeout on deleting a snapshot repository

I'm running elasticsearch 1.7.5 w/ 19 nodes (12 data nodes).

Attempting to setup snapshots for backup and recovery - but am getting a 503 on creation and deletion of a snapshot repository.

curl -XDELETE 'localhost:9200/_snapshot/backups?pretty'

returns:

{
 "error" : "RemoteTransportException[[masternodename][inet[/10.0.0.20:9300]][cluster:admin/repository/delete]]; nested: ProcessClusterEventTimeoutException[failed to process cluster event (delete_repository [backups]) within 30s]; ",
 "status" : 503
}

I was able to adjust the query w/ a master_timeout=10m - still getting a timeout. Is there a way to debug the cause of this request failing?

Thanks.

Do you see a lot of pending tasks on your master node when it happens? If you do, how many tasks are there and what's their most common source?

Originally the cluster had ~10 running tasks (with a higher priority than the put/delete repo). Trying the action again today w/ 0 running tasks - it runs w/ out delay.

I'll monitor - but that may have been the issue.

Thanks.

Yes, I typically see it on large cluster states when cluster state update tasks in front of a delete repo task take too long to finish (because of a large number of nodes, indices, types, aliases etc.). We have improved the situation in later versions of elasticsearch by switching to shipping cluster state diffs instead of a complete cluster state and implementing cluster state task batching in more places.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.