I use curator to take the backups and after grabbing backups successfully it fails when it tries to delete old snapshots because that's when it requires a list too:
2017-07-25 11:53:02,191 ERROR Failed to complete action: delete_snapshots. <class 'curator.exceptions.FailedExecution'>: Unable to get snapshot information from repository: long_term. Error: TransportError(500, 'null_pointer_exception', '[SVVyQPF][10.127.1.203:9300][cluster:admin/snapshot/get]')
I have a feeling this is due to some kind of timeout. I turned on debug logging and while I couldn't find a more specific reason this fails I noticed it made ~ 2K requests to S3 until it failed and it stopped at 90 seconds. Is this a configurable timeout?
Nope it started a few days before so it started while on 5.4.1.curling for the list of snapshots took increasingly long in the past but it still worked. Eventually I reached a curator timeout which I increased and it was working again. Now it seems to be breaking before it hits that timeout.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.