S3 snapshotting timed out requiring Elasticsearch process restart - 6.3.0

We ran into an issue where we were unable to read or write snapshots to s3. The error is listed below. We mitigated the issue by restarting the Elasticsearch process on the master node.

On an independent cluster we also had a similar timeout problem but this time it was reported by the data nodes and restarting the data nodes fixed the problem on this cluster.

curl -s localhost:9200/_cat/snapshots/repo_name
{
  "error": {
    "caused_by": {
      "caused_by": {
        "reason": "connect timed out",
        "type": "i_o_exception"
      },
      "reason": "Connect to repo_name.s3.amazonaws.com:443 [repo_name.s3.amazonaws.com/52.217.0.51] failed: connect timed out",
      "type": "i_o_exception"
    },
    "reason": "sdk_client_exception: Unable to execute HTTP request: Connect to intersearch-snapshots-usersearch-v11-0.s3.amazonaws.com:443 [repo_name.s3.amazonaws.com/52.217.0.51] failed: connect timed out",
    "root_cause": [
      {
        "reason": "sdk_client_exception: Unable to execute HTTP request: Connect to repo_name.s3.amazonaws.com:443 [repo_name.s3.amazonaws.com/52.217.0.51] failed: connect timed out",
        "type": "sdk_client_exception"
      }
    ],
    "type": "sdk_client_exception"
  },
  "status": 500
}

Restarting the elasticsearch process with /etc/init.d/elasticsearch restart fixed the issue.

curl -s localhost:9200/_cat/snapshots/repo_name
repo_name_2018-12-26t04:00:06z SUCCESS 1545796807 04:00:07 1545796837 04:00:37 30.6s 4 40 0 40

The clusters this happened on were running Elasticsearch version 6.3.0

Our hypothesis is that there was a cached DNS record somewhere in the java process and when something on the AWS side changed this cached record was no longer valid.

This issue might be related - Cannot list snapshots in S3 repository (previously working fine)

I'm from this same company as Martin_Brennan.

We just experienced this problem again today on multiple clusters. Our search clusters are quite large (40 data-nodes) and under a lot of ingestion and query load so preforming a rolling restart of all nodes is not something we are keen to do. So far we have just be restarting the nodes that have exhibited this timeout issue during snapshot creation.

For example, today we saw the following error during snapshotting

{
  "duration_in_millis": 345001,
  "end_time": "2019-02-02T00:05:48.480Z",
  "end_time_in_millis": 1549065948480,
  "failures": [
    {
      "index": "redacted",
      "index_uuid": "redacted",
      "node_id": "lxtigl9JRvm1dLX0RurmUg",
      "reason": "IndexShardSnapshotFailedException[com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to redacted.s3.amazonaws.com:443 [redacted.s3.amazonaws.com/52.216.176.27] failed: connect timed out]; nested: SdkClientException[Unable to execute HTTP request: Connect to redacted.s3.amazonaws.com:443 [redacted.s3.amazonaws.com/52.216.176.27] failed: connect timed out]; nested: ConnectTimeoutException[Connect to redacted.s3.amazonaws.com:443 [redacted.s3.amazonaws.com/52.216.176.27] failed: connect timed out]; nested: SocketTimeoutException[connect timed out]; ",
      "shard_id": 11,
      "status": "INTERNAL_SERVER_ERROR"
    },
    {
      "index": "redacted",
      "index_uuid": "redacted",
      "node_id": "lxtigl9JRvm1dLX0RurmUg",
      "reason": "IndexShardSnapshotFailedException[com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to redacted.s3.amazonaws.com:443 [redacted.s3.amazonaws.com/52.216.176.27] failed: connect timed out]; nested: SdkClientException[Unable to execute HTTP request: Connect to redacted.s3.amazonaws.com:443 [redacted.s3.amazonaws.com/52.216.176.27] failed: connect timed out]; nested: ConnectTimeoutException[Connect to redacted.s3.amazonaws.com:443 [redacted.s3.amazonaws.com/52.216.176.27] failed: connect timed out]; nested: SocketTimeoutException[connect timed out]; ",
      "shard_id": 13,
      "status": "INTERNAL_SERVER_ERROR"
    },
    {
      "index": "redacted",
      "index_uuid": "redacted",
      "node_id": "lxtigl9JRvm1dLX0RurmUg",
      "reason": "IndexShardSnapshotFailedException[com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to redacted.s3.amazonaws.com:443 [redacted.s3.amazonaws.com/52.216.176.27] failed: connect timed out]; nested: SdkClientException[Unable to execute HTTP request: Connect to redacted.s3.amazonaws.com:443 [redacted.s3.amazonaws.com/52.216.176.27] failed: connect timed out]; nested: ConnectTimeoutException[Connect to redacted.s3.amazonaws.com:443 [redacted.s3.amazonaws.com/52.216.176.27] failed: connect timed out]; nested: SocketTimeoutException[connect timed out]; ",
      "shard_id": 11,
      "status": "INTERNAL_SERVER_ERROR"
    },
    {
      "index": "redacted",
      "index_uuid": "redacted",
      "node_id": "lxtigl9JRvm1dLX0RurmUg",
      "reason": "IndexShardSnapshotFailedException[com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to redacted.s3.amazonaws.com:443 [redacted.s3.amazonaws.com/52.216.176.27] failed: connect timed out]; nested: SdkClientException[Unable to execute HTTP request: Connect to redacted.s3.amazonaws.com:443 [redacted.s3.amazonaws.com/52.216.176.27] failed: connect timed out]; nested: ConnectTimeoutException[Connect to redacted.s3.amazonaws.com:443 [redacted.s3.amazonaws.com/52.216.176.27] failed: connect timed out]; nested: SocketTimeoutException[connect timed out]; ",
      "shard_id": 14,
      "status": "INTERNAL_SERVER_ERROR"
    },
    {
      "index": "redacted",
      "index_uuid": "redacted",
      "node_id": "lxtigl9JRvm1dLX0RurmUg",
      "reason": "IndexShardSnapshotFailedException[com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to redacted.s3.amazonaws.com:443 [redacted.s3.amazonaws.com/52.216.176.27] failed: connect timed out]; nested: SdkClientException[Unable to execute HTTP request: Connect to redacted.s3.amazonaws.com:443 [redacted.s3.amazonaws.com/52.216.176.27] failed: connect timed out]; nested: ConnectTimeoutException[Connect to redacted.s3.amazonaws.com:443 [redacted.s3.amazonaws.com/52.216.176.27] failed: connect timed out]; nested: SocketTimeoutException[connect timed out]; ",
      "shard_id": 12,
      "status": "INTERNAL_SERVER_ERROR"
    },
    {
      "index": "redacted",
      "index_uuid": "redacted",
      "node_id": "lxtigl9JRvm1dLX0RurmUg",
      "reason": "IndexShardSnapshotFailedException[com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to redacted.s3.amazonaws.com:443 [redacted.s3.amazonaws.com/52.216.176.27] failed: connect timed out]; nested: SdkClientException[Unable to execute HTTP request: Connect to redacted.s3.amazonaws.com:443 [redacted.s3.amazonaws.com/52.216.176.27] failed: connect timed out]; nested: ConnectTimeoutException[Connect to redacted.s3.amazonaws.com:443 [redacted.s3.amazonaws.com/52.216.176.27] failed: connect timed out]; nested: SocketTimeoutException[connect timed out]; ",
      "shard_id": 19,
      "status": "INTERNAL_SERVER_ERROR"
    }
  ],
  "include_global_state": true,
  "indices": [
    "redacted",
    "redacted",
    "redacted",
    "redacted",
    "redacted",
    "redacted",
    "redacted",
    "redacted",
    "redacted",
    "redacted"
  ],
  "shards": {
    "failed": 6,
    "successful": 194,
    "total": 200
  },
  "snapshot": "redacted_2019-02-02t00:00:03z",
  "start_time": "2019-02-02T00:00:03.479Z",
  "start_time_in_millis": 1549065603479,
  "state": "PARTIAL",
  "uuid": "mbCeCSWuTXWkauPLgNq3Hg",
  "version": "6.3.0",
  "version_id": 6030099
}

Note: The 6 shards that failed to snapshot were all on the same host lxtigl9JRvm1dLX0RurmUg. Every subsequent attempt to create a snapshot results in the exact same error, with only this single node failing with a timeout to S3.

Restarting the Elasticsearch process on this host, waiting for the cluster to go green, and then snapshotting again is successful.

This is pretty serious problem for us, we need to be able to reliably take snapshots every 24 hours. If there is more information you need us to provide in order to get this triaged, please let us know.

The Java virtual machine (JVM) caches DNS name lookups. When the JVM resolves a hostname to an IP address, it caches the IP address for a specified period of time, known as the time-to-live (TTL).

Because AWS resources use DNS name entries that occasionally change, we recommend that you configure your JVM with a TTL value of no more than 60 seconds. This ensures that when a resource's IP address changes, your application will be able to receive and use the resource's new IP address by requerying the DNS.

On some Java configurations, the JVM default TTL is set so that it will never refresh DNS entries until the JVM is restarted. Thus, if the IP address for an AWS resource changes while your application is still running, it won't be able to use that resource until you manually restart the JVM and the cached IP information is refreshed. In this case, it's crucial to set the JVM's TTL so that it will periodically refresh its cached IP information.

This ^ seems likely to be the root cause:
https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/java-dg-jvm-ttl.html

Should you guys be setting this TTL somewhere in this repo? https://github.com/elastic/elasticsearch/tree/master/plugins/repository-s3

Just in case it's useful, according to the Elasticsearch log, we are NOT passing any JVM arg that could be overriding any ttl defaults.

[2019-02-02T14:33:52,392][INFO ][o.e.n.Node ] [es-master-3.localdomain] version[6.3.0], pid[27981], build[default/rpm/424e937/2018-06-11T23:38:03.357887Z], OS[Linux/4.14.88-72.76.amzn1.x86_64/amd64], JVM[Oracle Corporation/Java HotSpot(TM) 64-Bit Server VM/1.8.0_171/25.171-b11]

[2019-02-02T14:33:52,392][INFO ][o.e.n.Node ] [es-master-3.localdomain] JVM arguments [-Xms7789m, -Xmx7789m, -XX:+UseConcMarkSweepGC, -XX:CMSInitiatingOccupancyFraction=75, -XX:+UseCMSInitiatingOccupancyOnly, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -XX:-OmitStackTraceInFastThrow, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -XX:+PrintGCDetails, -XX:+PrintGCDateStamps, -XX:+PrintTenuringDistribution, -XX:+PrintGCApplicationStoppedTime, -Xloggc:logs/gc.log, -XX:+UseGCLogFileRotation, -XX:NumberOfGCLogFiles=32, -XX:GCLogFileSize=64m, -Des.index.memory.max_index_buffer_size=10240mb, -Des.path.home=/usr/share/elasticsearch, -Des.path.conf=/etc/elasticsearch, -Des.distribution.flavor=default, -Des.distribution.type=rpm]

I am experiencing the same issue here, on a 6.5.0 cluster. As mentioned above, restarting the node that gave the errors fixed the issue for the time being.

Just in case it's useful, according to the Elasticsearch log, we are NOT passing any JVM arg that could be overriding any ttl defaults.

Same here.

It does indeed sound like a DNS caching issue. The manual for 6.5 and earlier says:

Elasticsearch runs with a security manager in place. With a security manager in place, the JVM defaults to caching positive hostname resolutions indefinitely. If your Elasticsearch nodes rely on DNS in an environment where DNS resolutions vary with time (e.g., for node-to-node discovery) then you might want to modify the default JVM behavior. This can be modified by adding networkaddress.cache.ttl=<timeout> to your Java security policy. Any hosts that fail to resolve will be logged. Note also that with the Java security manager in place, the JVM defaults to caching negative hostname resolutions for ten seconds. This can be modified by adding networkaddress.cache.negative.ttl=<timeout> to your Java security policy.

This was a deliberate decision, made some time ago, but we have reflected on its consequences more recently and changed our mind so that in 6.6 by default we now do not cache DNS resolutions forever:

Elasticsearch runs with a security manager in place. With a security manager in place, the JVM defaults to caching positive hostname resolutions indefinitely and defaults to caching negative hostname resolutions for ten seconds. Elasticsearch overrides this behavior with default values to cache positive lookups for sixty seconds, and to cache negative lookups for ten seconds. These values should be suitable for most environments, including environments where DNS resolutions vary with time. If not, you can edit the values es.networkaddress.cache.ttl and es.networkaddress.cache.negative.ttl in the JVM options. Note that the values networkaddress.cache.ttl=<timeout> and networkaddress.cache.negative.ttl=<timeout> in theJava security policy are ignored by Elasticsearch unless you remove the settings for es.networkaddress.cache.ttl and es.networkaddress.cache.negative.ttl .

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.