Searchable Snapshot/Cold Nodes much slower to recover 8.5+?

BenB196 · February 24, 2023, 4:53pm

Hi All,

I was wondering if anyone else has noticed that since upgrading to 8.5.x (and 8.6.x), that the recovery of Searchable Snapshot/Cold nodes from a rolling restart is much slower than in lower versions?

A cold node with a few hundred shards used to take maybe around 15 minutes to recover. However, since upgrading to 8.5.x and 8.6.x, these same cold nodes now take 30 minutes to recover.

Note: The searchable snapshots are allocated:

{
  "note": "No shard was specified in the explain API request, so this response explains a randomly chosen unassigned shard. There may be other unassigned shards in this cluster which cannot be assigned for different reasons. It may not be possible to assign this shard until one of the other shards is assigned correctly. To explain the allocation of other shards (whether assigned or unassigned) you must specify the target shard in the request to this API.",
  "index": "restored-.ds-logs-generic.default-2022.03.20-000002",
  "shard": 0,
  "primary": true,
  "current_state": "unassigned",
  "unassigned_info": {
    "reason": "NODE_RESTARTING",
    "at": "2023-02-24T16:38:47.251Z",
    "details": "node_left [p7CoIvggReyPwBnTmeEvug]",
    "last_allocation_status": "no"
  },
  "can_allocate": "yes",
  "allocate_explanation": "Elasticsearch can allocate the shard.",
  "target_node": {
    "id": "p7CoIvggReyPwBnTmeEvug",
    "name": "es-prod-es-rack1-data-cold-0",
    "transport_address": "10.42.3.233:9300",
    "attributes": {
      "k8s_node_name": "k8s02-es",
      "xpack.installed": "true",
      "zone": "rack1"
    }
  },
  "node_allocation_decisions": [
    {
      "node_id": "p7CoIvggReyPwBnTmeEvug",
      "node_name": "es-prod-es-rack1-data-cold-0",
      "transport_address": "10.42.3.233:9300",
      "node_attributes": {
        "k8s_node_name": "k8s02-es",
        "xpack.installed": "true",
        "zone": "rack1"
      },
      "node_decision": "yes",
      "store": {
        "matching_size_in_bytes": 37970143264
      }
    },

It's just that they take far longer to actually recover now.

Would anyone have any ideas/insight into this issue?

Some background, the cluster is running on Kubernetes and is fully managed by ECK, which is doing the actual work of the rolling restart.

stephenb · February 25, 2023, 8:58pm

Hiya @BenB196

Seems like a good case for a support ticket

system · March 25, 2023, 8:59pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch rolling restart recovery is slow Elasticsearch	3	1239	January 10, 2020
Mounted Searchable Snapshot Delay Reallocation/Recovery Elasticsearch elastic-stack-searchable-snapshots	14	860	April 12, 2022
Elasticsearch 7.16 shard recovery slow Elasticsearch	7	289	January 30, 2024
Shard re-allocation taking a very long time Elasticsearch	16	7531	April 15, 2019
Restarting of node taking much time Elasticsearch	6	2430	July 6, 2017

Searchable Snapshot/Cold Nodes much slower to recover 8.5+?

Related topics