NullPointerException when getting list of snapshots on S3


#1

Hello,

I'm using the repository-s3 plugin and after using it for some months I now cannot get the list of snapshots:

curl elasticsearch:9200/_snapshot/long_term/_all
{"error":{"root_cause":[{"type":"remote_transport_exception","reason":"[SVVyQPF][10.127.1.203:9300][cluster:admin/snapshot/get]"}],"type":"null_pointer_exception","reason":null},"status":500}

Elasticsearch logs this:

[2017-07-25T12:01:47,038][WARN ][r.suppressed             ] path: /_snapshot/long_term/_all, params: {repository=long_term, snapshot=_all}
org.elasticsearch.transport.RemoteTransportException: [SVVyQPF][10.127.1.203:9300][cluster:admin/snapshot/get]
Caused by: java.lang.NullPointerException

I use curator to take the backups and after grabbing backups successfully it fails when it tries to delete old snapshots because that's when it requires a list too:

2017-07-25 11:53:02,191 ERROR     Failed to complete action: delete_snapshots.  <class 'curator.exceptions.FailedExecution'>: Unable to get snapshot information from repository: long_term.  Error: TransportError(500, 'null_pointer_exception', '[SVVyQPF][10.127.1.203:9300][cluster:admin/snapshot/get]')

I have a feeling this is due to some kind of timeout. I turned on debug logging and while I couldn't find a more specific reason this fails I noticed it made ~ 2K requests to S3 until it failed and it stopped at 90 seconds. Is this a configurable timeout?

Please help! I got no access to my backups!


(David Pilato) #2

That sounds very bad.

I think you should open an issue in elasticsearch.
Can you tell more about your version?


#3

Just upgraded to 5.5.0.


(David Pilato) #4

And it appeared just after the upgrade?


#5

Nope it started a few days before so it started while on 5.4.1.curling for the list of snapshots took increasingly long in the past but it still worked. Eventually I reached a curator timeout which I increased and it was working again. Now it seems to be breaking before it hits that timeout.


(David Pilato) #6

I see. AFAIK we don't set any 90s timeout when we are using the AWS SDK and defaults don't seem to set that as per http://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/section-client-configuration.html.

We are just setting the setSocketTimeout which is by default 50s.

And AFAIK it only set the timeout for a single API call.

May be it's a more global timeout that we have in snapshot/restore? @Igor_Motov Do you have an idea?


#7

Opened an issue too: https://github.com/elastic/elasticsearch/issues/25878


(Igor Motov) #8

Can you check logs on 10.127.1.203 around 2017-07-25T12:01:47 to see if there are any other messages there?


#9

Added some more logs to the issue https://github.com/elastic/elasticsearch/issues/25878#issuecomment-317748498. Regarding the time I am providing logs from a request initiated at 2017-07-25T12:26:30 because that's when I had DEBUG on.


(system) #10

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.