I'm from this same company as Martin_Brennan.
We just experienced this problem again today on multiple clusters. Our search clusters are quite large (40 data-nodes) and under a lot of ingestion and query load so preforming a rolling restart of all nodes is not something we are keen to do. So far we have just be restarting the nodes that have exhibited this timeout issue during snapshot creation.
For example, today we saw the following error during snapshotting
{
"duration_in_millis": 345001,
"end_time": "2019-02-02T00:05:48.480Z",
"end_time_in_millis": 1549065948480,
"failures": [
{
"index": "redacted",
"index_uuid": "redacted",
"node_id": "lxtigl9JRvm1dLX0RurmUg",
"reason": "IndexShardSnapshotFailedException[com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to redacted.s3.amazonaws.com:443 [redacted.s3.amazonaws.com/52.216.176.27] failed: connect timed out]; nested: SdkClientException[Unable to execute HTTP request: Connect to redacted.s3.amazonaws.com:443 [redacted.s3.amazonaws.com/52.216.176.27] failed: connect timed out]; nested: ConnectTimeoutException[Connect to redacted.s3.amazonaws.com:443 [redacted.s3.amazonaws.com/52.216.176.27] failed: connect timed out]; nested: SocketTimeoutException[connect timed out]; ",
"shard_id": 11,
"status": "INTERNAL_SERVER_ERROR"
},
{
"index": "redacted",
"index_uuid": "redacted",
"node_id": "lxtigl9JRvm1dLX0RurmUg",
"reason": "IndexShardSnapshotFailedException[com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to redacted.s3.amazonaws.com:443 [redacted.s3.amazonaws.com/52.216.176.27] failed: connect timed out]; nested: SdkClientException[Unable to execute HTTP request: Connect to redacted.s3.amazonaws.com:443 [redacted.s3.amazonaws.com/52.216.176.27] failed: connect timed out]; nested: ConnectTimeoutException[Connect to redacted.s3.amazonaws.com:443 [redacted.s3.amazonaws.com/52.216.176.27] failed: connect timed out]; nested: SocketTimeoutException[connect timed out]; ",
"shard_id": 13,
"status": "INTERNAL_SERVER_ERROR"
},
{
"index": "redacted",
"index_uuid": "redacted",
"node_id": "lxtigl9JRvm1dLX0RurmUg",
"reason": "IndexShardSnapshotFailedException[com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to redacted.s3.amazonaws.com:443 [redacted.s3.amazonaws.com/52.216.176.27] failed: connect timed out]; nested: SdkClientException[Unable to execute HTTP request: Connect to redacted.s3.amazonaws.com:443 [redacted.s3.amazonaws.com/52.216.176.27] failed: connect timed out]; nested: ConnectTimeoutException[Connect to redacted.s3.amazonaws.com:443 [redacted.s3.amazonaws.com/52.216.176.27] failed: connect timed out]; nested: SocketTimeoutException[connect timed out]; ",
"shard_id": 11,
"status": "INTERNAL_SERVER_ERROR"
},
{
"index": "redacted",
"index_uuid": "redacted",
"node_id": "lxtigl9JRvm1dLX0RurmUg",
"reason": "IndexShardSnapshotFailedException[com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to redacted.s3.amazonaws.com:443 [redacted.s3.amazonaws.com/52.216.176.27] failed: connect timed out]; nested: SdkClientException[Unable to execute HTTP request: Connect to redacted.s3.amazonaws.com:443 [redacted.s3.amazonaws.com/52.216.176.27] failed: connect timed out]; nested: ConnectTimeoutException[Connect to redacted.s3.amazonaws.com:443 [redacted.s3.amazonaws.com/52.216.176.27] failed: connect timed out]; nested: SocketTimeoutException[connect timed out]; ",
"shard_id": 14,
"status": "INTERNAL_SERVER_ERROR"
},
{
"index": "redacted",
"index_uuid": "redacted",
"node_id": "lxtigl9JRvm1dLX0RurmUg",
"reason": "IndexShardSnapshotFailedException[com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to redacted.s3.amazonaws.com:443 [redacted.s3.amazonaws.com/52.216.176.27] failed: connect timed out]; nested: SdkClientException[Unable to execute HTTP request: Connect to redacted.s3.amazonaws.com:443 [redacted.s3.amazonaws.com/52.216.176.27] failed: connect timed out]; nested: ConnectTimeoutException[Connect to redacted.s3.amazonaws.com:443 [redacted.s3.amazonaws.com/52.216.176.27] failed: connect timed out]; nested: SocketTimeoutException[connect timed out]; ",
"shard_id": 12,
"status": "INTERNAL_SERVER_ERROR"
},
{
"index": "redacted",
"index_uuid": "redacted",
"node_id": "lxtigl9JRvm1dLX0RurmUg",
"reason": "IndexShardSnapshotFailedException[com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to redacted.s3.amazonaws.com:443 [redacted.s3.amazonaws.com/52.216.176.27] failed: connect timed out]; nested: SdkClientException[Unable to execute HTTP request: Connect to redacted.s3.amazonaws.com:443 [redacted.s3.amazonaws.com/52.216.176.27] failed: connect timed out]; nested: ConnectTimeoutException[Connect to redacted.s3.amazonaws.com:443 [redacted.s3.amazonaws.com/52.216.176.27] failed: connect timed out]; nested: SocketTimeoutException[connect timed out]; ",
"shard_id": 19,
"status": "INTERNAL_SERVER_ERROR"
}
],
"include_global_state": true,
"indices": [
"redacted",
"redacted",
"redacted",
"redacted",
"redacted",
"redacted",
"redacted",
"redacted",
"redacted",
"redacted"
],
"shards": {
"failed": 6,
"successful": 194,
"total": 200
},
"snapshot": "redacted_2019-02-02t00:00:03z",
"start_time": "2019-02-02T00:00:03.479Z",
"start_time_in_millis": 1549065603479,
"state": "PARTIAL",
"uuid": "mbCeCSWuTXWkauPLgNq3Hg",
"version": "6.3.0",
"version_id": 6030099
}
Note: The 6 shards that failed to snapshot were all on the same host lxtigl9JRvm1dLX0RurmUg
. Every subsequent attempt to create a snapshot results in the exact same error, with only this single node failing with a timeout to S3.
Restarting the Elasticsearch process on this host, waiting for the cluster to go green, and then snapshotting again is successful.
This is pretty serious problem for us, we need to be able to reliably take snapshots every 24 hours. If there is more information you need us to provide in order to get this triaged, please let us know.