The error given happens when the request's timestamp is 15 minutes
earlier than the request is received.
Here's what I think is happening:
We start the snapshot at 23:17
[23:17:42,020][DEBUG][index.gateway ]
[Blackheart] [tweet-2009q4][8] snapshot (api) to s3://infochimps-
search/[hoolock][indices][tweet-2009q4][8] ...
ES builds out many dozen requests, enough that it fills the pool of
available connections:
[23:17:43,350][DEBUG]
[com.amazonaws.services.s3.internal.S3Signer] Calculated string to
sign:"PUT
application/octet-stream
Mon, 10 Jan 2011 23:17:43 GMT
/infochimps-search/hoolock%2Findices
%2Ftweet-2009q4%2F8%2F__2o.part72"
[23:17:43,350][DEBUG]
[org.apache.commons.httpclient.MultiThreadedHttpConnectionManager]
HttpConnectionManager.getConnection: config =
HostConfiguration[host=http://infochimps-search.s3.amazonaws.com],
timeout = 0
[23:17:43,350][DEBUG]
[org.apache.commons.httpclient.params.DefaultHttpParams] Set parameter
http.protocol.expect-continue = true
[23:17:43,350][DEBUG][com.amazonaws.http.HttpClient] Not
overwriting Content-Type; already set to: Content-Type: application/
octet-stream
[23:17:43,350][DEBUG]
[org.apache.commons.httpclient.MultiThreadedHttpConnectionManager]
Unable to get a connection, waiting...,
hostConfig=HostConfiguration[host=http://infochimps-
search.s3.amazonaws.com]
All the connections ahead of it start running; there are enough that
some connections don't start within 15 minutes.
Right at the 15 minute mark, the connections being popped start
failing. Here's the last one that worked and the first of a dozen or
so that failed:
[23:32:46,379][DEBUG]
[org.apache.commons.httpclient.MultiThreadedHttpConnectionManager]
Getting free connection, hostConfig=HostConfiguration[host=http://
infochimps-search.s3.amazonaws.com]
[23:32:46,387][DEBUG]
[org.apache.commons.httpclient.HttpMethodBase] Adding Host request
header
[23:32:46,393][DEBUG]
[org.apache.commons.httpclient.HttpMethodBase] Should close connection
in response to directive: close
[23:32:46,393][DEBUG]
[org.apache.commons.httpclient.HttpConnection] Releasing connection
back to connection manager.
[23:32:46,393][DEBUG]
[org.apache.commons.httpclient.MultiThreadedHttpConnectionManager]
Freeing connection, hostConfig=HostConfiguration[host=http://
infochimps-search.s3.amazonaws.com]
[23:32:46,393][DEBUG]
[org.apache.commons.httpclient.util.IdleConnectionHandler] Adding
connection at: 1294702366393
[23:32:46,393][DEBUG]
[org.apache.commons.httpclient.MultiThreadedHttpConnectionManager]
Notifying thread waiting on host pool,
hostConfig=HostConfiguration[host=http://infochimps-
search.s3.amazonaws.com]
[23:32:46,394][DEBUG]
[org.apache.commons.httpclient.MultiThreadedHttpConnectionManager]
Getting free connection, hostConfig=HostConfiguration[host=http://
infochimps-search.s3.amazonaws.com]
[23:32:46,394][DEBUG]
[org.apache.commons.httpclient.HttpConnection] Open connection to
infochimps-search.s3.amazonaws.com:80
[23:32:46,398][INFO ][com.amazonaws.request ] Received
error response: Status Code: 403, AWS Request ID: XXXXXX, AWS Error
Code: RequestTimeTooSkewed, AWS Error Message: The difference between
the request time and the current time is too large., S3 Extended
Request ID: XXXXXX
===================================================
Takeaways:
== 90 percent of the files (taking about 15 minutes to upload)
succeeded, yet ES tried to snapshot all of them. If ES compared the
MD5 hash of the file in the bucket (an inexpensive AWS API call) with
the shard segment it was about to send this can be avoided.
== Have the HTTP connections calculate their signature only when the
connection becomes available
== This is so close to working (succeeded on 12 of 16 shards) that I'm
going to try again at 3am and see if it can run 10% faster Apart
from that, is there some figure I can munge to increase the size of
the HTTP connection pool so that they all start simultaneously?
flip
On Jan 10, 1:50 pm, jacob jacob.a.perk...@gmail.com wrote:
I've got a very large index ~450gb that I'm in the process of
snapshotting to the s3 gateway. After issuing the snapshot command, with
the rest api, there's tons of network traffic as shards are copied to s3
until the following:
{"ok":true,"_shards":{"total":16,"successful":12,"failed":4,"failures":[{"i ndex":"tweet-2009q4","shard":10,"reason":"BroadcastShardOperationFailedExce ption[[tweet-2009q4]10] ]; nested: RemoteTransportException[[Corsair][inet[/10.117.110.126:9300]][indices/gate way/snapshot/shard]]; nested: IndexShardGatewaySnapshotFailedException[[tweet-2009q4]10] Failed to perform snapshot (index files)]; nested: AmazonS3Exception[The difference between the request time and the current time is too large.]; "},{"index":"tweet-2009q4","shard":8,"reason":"BroadcastShardOperationFaile dException[[tweet-2009q4]8] ]; nested: RemoteTransportException[[Blackheart][inet[/10.117.31.211:9300]][indices/ga teway/snapshot/shard]]; nested: IndexShardGatewaySnapshotFailedException[[tweet-2009q4]8] Failed to perform snapshot (index files)]; nested: AmazonS3Exception[The difference between the request time and the current time is too large.]; "},{"index":"tweet-2009q4","shard":9,"reason":"BroadcastShardOperationFaile dException[[tweet-2009q4]9] ]; nested: RemoteTransportException[[Silverclaw][inet[/10.102.41.43:9300]][indices/gat eway/snapshot/shard]]; nested: IndexShardGatewaySnapshotFailedException[[tweet-2009q4]9] Failed to perform snapshot (index files)]; nested: AmazonS3Exception[The difference between the request time and the current time is too large.]; "},{"index":"tweet-2009q4","shard":14,"reason":"BroadcastShardOperationFail edException[[tweet-2009q4]14] ]; nested: RemoteTransportException[[Noble, Peter][inet[/10.218.55.169:9300]][indices/gateway/snapshot/shard]]; nested: IndexShardGatewaySnapshotFailedException[[tweet-2009q4]14] Failed to perform snapshot (index files)]; nested: AmazonS3Exception[The difference between the request time and the current time is too large.]; "}]}}
now, this has happened a few times. First it was 7 failed shards, then
6, then 5, and now 4 as I rerun the same command. Any ideas as to what
might be causing this?
--jacob