Peer Recovery does not respect indices.recovery.use_snapshots=false or use_for_peer_recovery=false - misleading documentation? (ES 8.13.4)

doitMLU · October 22, 2025, 11:17am

Hi everyone,

We recently had a big spike in S3 transfers and subsequently costs, which let me to troubleshoot the configuration of our elastic stack. I’ve found the cluster setting indices.recovery.use_snapshots which defaults to true and I set it to false. I also set the Snapshot Repository setting use_for_peer_recovery: false.

According to the documentation (8.13, latest), this should lead to peer recoveries copying shard data from one node to the other node, instead of using the repository for peer recovery (“When indices.recovery.use_snapshots is false Elasticsearch will construct this new copy by transferring the index data from the current primary.”). It also states that “If none of the registered repositories have this setting defined, index files will be recovered from the source node”.

Looking at the actual recoveries done after these changes reveals that they are still downloading shard data from the snapshot repository instead of copying it from the source node. (Change was done 2025-10-21 around 14:00, examples below are from 2025-10-22)

These peer recoveries were all done to rebalance the cluster.

Am I missing something crucial here, or is this setting not working?

More infos regarding our setup:

I know that this is an old version of ES, but I have not found any info in GitHub or elsewhere that would suggest changes to these settings happened.

ILM policy creates searchable snapshots after hot rollover, cold is configured with searchable snapshots for 30-60 days after rollover, deletion happens between 90d and 1y.

//GET /_cluster/settings
{
  "persistent": {
    "action": {
      "auto_create_index": ".ent-search-*-logs-*,-.ent-search-*,+*"
    },
    "cluster": {
      "routing": {
        "allocation": {
          "node_concurrent_incoming_recoveries": "4",
          "disk": {
            "watermark": {
              "low": {
                "max_headroom": "185GB"
              },
              "flood_stage": {
                "max_headroom": "70GB"
              },
              "high": {
                "max_headroom": "120GB"
              }
            }
          },
          "node_initial_primaries_recoveries": "6",
          "balance": {
            "threshold": "50"
          },
          "node_concurrent_outgoing_recoveries": "4",
          "cluster_concurrent_rebalance": "4",
          "node_concurrent_recoveries": "4"
        }
      },
      "max_shards_per_node": "13500"
    },
    "indices": {
      "recovery": {
        "use_snapshots": "false"
      }
    },
    "search": {
      "max_async_search_response_size": "50mb"
    }
  }
}

//GET _snapshot/searchable_snapshot
{
  "searchable_snapshot": {
    "type": "s3",
    "uuid": "<redacted>",
    "settings": {
      "bucket": "<redacted>",
      "region": "eu-central-1",
      "use_for_peer_recovery": "false"
    }
  }
}

//GET _recovery?active_only=false&detailed=false&human=true
"restored-.ds-logs-network_traffic.icmp-otc2_audit_prod-2024.11.08-000008": {
    "shards": [
      {
        "id": 0,
        "type": "PEER",
        "stage": "DONE",
        "primary": true,
        "start_time": "2025-10-22T06:18:53.516Z",
        "start_time_in_millis": 1761113933516,
        "stop_time": "2025-10-22T06:19:03.282Z",
        "stop_time_in_millis": 1761113943282,
        "total_time": "9.7s",
        "total_time_in_millis": 9765,
        "source": {
          "id": "IyJ55ugtTQafWBo1gx0-pg",
          "host": "172.29.12.153",
          "transport_address": "172.29.12.153:9300",
          "ip": "172.29.12.153",
          "name": "elk12-data"
        },
        "target": {
          "id": "aobm-H99R42lWExkxhFzpA",
          "host": "172.29.12.148",
          "transport_address": "172.29.12.148:9300",
          "ip": "172.29.12.148",
          "name": "elk7-data"
        },
        "index": {
          "size": {
            "total": "359.1mb",
            "total_in_bytes": 376577524,
            "reused": "824b",
            "reused_in_bytes": 824,
            "recovered": "359.1mb",
            "recovered_in_bytes": 376576700,
            "recovered_from_snapshot": "359.1mb",
            "recovered_from_snapshot_in_bytes": 376576292,
            "percent": "100.0%"
          },
          "files": {
            "total": 5,
            "reused": 2,
            "recovered": 3,
            "percent": "100.0%"
          },
          "total_time": "9.7s",
          "total_time_in_millis": 9765,
          "source_throttle_time": "0s",
          "source_throttle_time_in_millis": 0,
          "target_throttle_time": "-1",
          "target_throttle_time_in_millis": 0
        },
        "translog": {
          "recovered": 0,
          "total": 0,
          "percent": "100.0%",
          "total_on_start": 0,
          "total_time": "0s",
          "total_time_in_millis": 0
        },
        "verify_index": {
          "check_index_time": "0s",
          "check_index_time_in_millis": 0,
          "total_time": "0s",
          "total_time_in_millis": 0
        }
      }
    ]
  },
"restored-corp_ad_winevent-008127": {
    "shards": [
      {
        "id": 0,
        "type": "PEER",
        "stage": "DONE",
        "primary": true,
        "start_time": "2025-10-22T04:54:05.017Z",
        "start_time_in_millis": 1761108845017,
        "stop_time": "2025-10-22T05:48:11.759Z",
        "stop_time_in_millis": 1761112091759,
        "total_time": "54.1m",
        "total_time_in_millis": 3246741,
        "source": {
          "id": "aobm-H99R42lWExkxhFzpA",
          "host": "172.29.12.148",
          "transport_address": "172.29.12.148:9300",
          "ip": "172.29.12.148",
          "name": "elk7-data"
        },
        "target": {
          "id": "2IfXZEt5T_-3Cdf5at9YDQ",
          "host": "172.29.12.150",
          "transport_address": "172.29.12.150:9300",
          "ip": "172.29.12.150",
          "name": "elk9-data"
        },
        "index": {
          "size": {
            "total": "61gb",
            "total_in_bytes": 65583714433,
            "reused": "1kb",
            "reused_in_bytes": 1109,
            "recovered": "61gb",
            "recovered_in_bytes": 65583713324,
            "recovered_from_snapshot": "61gb",
            "recovered_from_snapshot_in_bytes": 65583712912,
            "percent": "100.0%"
          },
          "files": {
            "total": 22,
            "reused": 2,
            "recovered": 20,
            "percent": "100.0%"
          },
          "total_time": "54.1m",
          "total_time_in_millis": 3246741,
          "source_throttle_time": "0s",
          "source_throttle_time_in_millis": 0,
          "target_throttle_time": "-1",
          "target_throttle_time_in_millis": 0
        },
        "translog": {
          "recovered": 0,
          "total": 0,
          "percent": "100.0%",
          "total_on_start": 0,
          "total_time": "0s",
          "total_time_in_millis": 0
        },
        "verify_index": {
          "check_index_time": "0s",
          "check_index_time_in_millis": 0,
          "total_time": "0s",
          "total_time_in_millis": 0
        }
      }
    ]
  }

DavidTurner · October 22, 2025, 2:34pm

Does this data transfer relate to searchable snapshots? indices.recovery.use_snapshots only applies to regular indices - searchable snapshots are always recovered from the snapshot repository.

Why is this an issue? Downloading data from S3 is pretty close to free. Unless you’re transferring it across regions, but these docs explain why that’s a very bad idea.

DavidTurner · October 22, 2025, 2:45pm

… or maybe these docs (just above the ones I linked previously):

However, if it’s particularly expensive to retrieve data from a snapshot repository in your environment, searchable snapshots may be more costly than regular indices. Ensure that the cost structure of your operating environment is compatible with searchable snapshots before using them.

doitMLU · October 22, 2025, 3:00pm

Oh I see, thanks for the input. I didn’t realize that these snapshot settings don’t change the behavior of the searchable_snapshots indices.

Also, downloading multiple 100s of TBs was indeed not close to free haha… But I inherited that infrastructure, so it might be time to re-architect a little…

This is an on-prem cluster with S3 in the cloud, so we probably have capacity to let the shard recoveries use node-transfers. We don’t have enough storage capacity for that though, which is why we use the S3 here. Is there some way to make this happen the way I previously thought it would work?

doitMLU · October 22, 2025, 3:32pm

Peer-only recovereies don’t seem to be intended.

After thinking about the architecture a bit, it seems like we’d be better off using hot (without snapshots) and frozen. I will try to find out if that is a good fit for all the different data that’s in there.

Thanks again!

DavidTurner · October 22, 2025, 4:18pm

on-prem cluster with S3 in the cloud

Oh yikes yes that’s going to be very expensive if you are using searchable snapshots. If you download data from S3 to EC2 instances within the same region you only pay $0.0000004 per request, regardless of data size. But if you’re pulling it down to an on-prem cluster that isn’t in EC2 then you will have to pay around $0.09/GB on top of that.

Topic		Replies	Views
A Question about Indices recovery type Elasticsearch	1	1401	March 7, 2018
Snapshot restore is very slow to get started Elasticsearch	13	3295	April 29, 2020
Fastest copy of index Elasticsearch	21	3161	May 1, 2018
Snapshot Restore from s3 or fs Elasticsearch	12	1913	April 5, 2017
Snapshot & Restore in a cluster of two nodes Elasticsearch	4	1294	July 6, 2017

Peer Recovery does not respect indices.recovery.use_snapshots=false or use_for_peer_recovery=false - misleading documentation? (ES 8.13.4)

Related topics