Elasticsearch snapshot and restore best practices

Hi All,

I want to know the best way, to store elastic data for recovery.

Currently we had each cluster hosted in azure and the cluster backed up daily using RSV.

However we wanted to do more frequently snapshots which make it more flexible for what we want to restore. So we configured hourly backups to azure storage.

In the event of a disaster restoring takes over 24 hours to restore the entire cluster, but also the cost of hourly snapshots is so high on azure storage (GRS)

we managed to get hourly snapshot and hourly restore to a 2nd region but im now questioning the snapshot backups to storage due to the cost that is growing. even with the delete snapshot older than 7 days its not helping.

Can anyone out there help make better suggestion or recommend how it could be more cost effective but also take less than 24 hours to restore an entire cluster.

Thanks

Welcome to our community! :smiley:

How big is your cluster? The output from the _cluster/stats?pretty&human API might be helpful here.

RSV is a filesystem-level backup. The docs say not to do this:

WARNING: The only reliable and supported way to back up a cluster is by taking a snapshot . You cannot back up an Elasticsearch cluster by making copies of the data directories of its nodes. There are no supported methods to restore any data from a filesystem-level backup. If you try to restore a cluster from such a backup, it may fail with reports of corruption or missing files or other data inconsistencies, or it may appear to have succeeded having silently lost some of your data.

That seems worth investigating further: you should be able to restore many TBs onto a single node in 24h, and it should scale linearly in the number of nodes. How much data are you talking about here?

1 Like

Thanks for the response!

I think this is the info your looking for..

"store": {
"size": "88.4gb",

I was after the full output please.

Please also format your code/logs/config using the </> button, or markdown style back ticks. It helps to make things easy to read which helps us help you :slight_smile:

Thanks

{
    "_nodes": {
        "total": 4,
        "successful": 4,
        "failed": 0
    },
    "cluster_name": "cluster",
    "timestamp": 1631673830321,
    "status": "green",
    "indices": {
        "count": 4070,
        "shards": {
            "total": 40700,
            "primaries": 20350,
            "replication": 1.0,
            "index": {
                "shards": {
                    "min": 10,
                    "max": 10,
                    "avg": 10.0
                },
                "primaries": {
                    "min": 5,
                    "max": 5,
                    "avg": 5.0
                },
                "replication": {
                    "min": 1.0,
                    "max": 1.0,
                    "avg": 1.0
                }
            }
        },
        "docs": {
            "count": 99590293,
            "deleted": 731484
        },
        "store": {
            "size": "88.4gb",
            "size_in_bytes": 94975452248
        },
        "fielddata": {
            "memory_size": "8.9mb",
            "memory_size_in_bytes": 9433584,
            "evictions": 0
        },
        "query_cache": {
            "memory_size": "25.5mb",
            "memory_size_in_bytes": 26785800,
            "total_count": 19805451,
            "hit_count": 6563667,
            "miss_count": 13241784,
            "cache_size": 10295,
            "cache_count": 303694,
            "evictions": 293399
        },
        "completion": {
            "size": "0b",
            "size_in_bytes": 0
        },
        "segments": {
            "count": 188194,
            "memory": "894.1mb",
            "memory_in_bytes": 937551115,
            "terms_memory": "618.4mb",
            "terms_memory_in_bytes": 648523110,
            "stored_fields_memory": "72.4mb",
            "stored_fields_memory_in_bytes": 76015168,
            "term_vectors_memory": "1.8mb",
            "term_vectors_memory_in_bytes": 1969944,
            "norms_memory": "21.4mb",
            "norms_memory_in_bytes": 22528768,
            "points_memory": "19.2mb",
            "points_memory_in_bytes": 20196533,
            "doc_values_memory": "160.5mb",
            "doc_values_memory_in_bytes": 168317592,
            "index_writer_memory": "0b",
            "index_writer_memory_in_bytes": 0,
            "version_map_memory": "0b",
            "version_map_memory_in_bytes": 0,
            "fixed_bit_set": "1.1mb",
            "fixed_bit_set_memory_in_bytes": 1243192,
            "max_unsafe_auto_id_timestamp": 1631360195137,
            "file_sizes": {}
        }
    },
    "nodes": {
        "count": {
            "total": 4,
            "data": 3,
            "coordinating_only": 1,
            "master": 3,
            "ingest": 3
        },
        "versions": [
            "6.2.2"
        ],
        "os": {
            "available_processors": 16,
            "allocated_processors": 16,
            "names": [
                {
                    "name": "Linux",
                    "count": 4
                }
            ],
            "mem": {
                "total": "125.6gb",
                "total_in_bytes": 134948945920,
                "free": "3.2gb",
                "free_in_bytes": 3536654336,
                "used": "122.3gb",
                "used_in_bytes": 131412291584,
                "free_percent": 3,
                "used_percent": 97
            }
        },
        "process": {
            "cpu": {
                "percent": 203
            },
            "open_file_descriptors": {
                "min": 259,
                "max": 33230,
                "avg": 24952
            }
        },
        "jvm": {
            "max_uptime": "96.6d",
            "max_uptime_in_millis": 8348291343,
            "versions": [
                {
                    "version": "1.8.0_252",
                    "vm_name": "OpenJDK 64-Bit Server VM",
                    "vm_version": "25.252-b09",
                    "vm_vendor": "Oracle Corporation",
                    "count": 4
                }
            ],
            "mem": {
                "heap_used": "28.6gb",
                "heap_used_in_bytes": 30714431080,
                "heap_max": "63.8gb",
                "heap_max_in_bytes": 68580016128
            },
            "threads": 627
        },
        "fs": {
            "total": "404.1gb",
            "total_in_bytes": 433937915904,
            "free": "302gb",
            "free_in_bytes": 324304998400,
            "available": "282.9gb",
            "available_in_bytes": 303800209408
        },
        "plugins": [
            {
                "name": "repository-azure",
                "version": "6.2.2",
                "description": "The Azure Repository plugin adds support for Azure storage repositories.",
                "classname": "org.elasticsearch.plugin.repository.azure.AzureRepositoryPlugin",
                "extended_plugins": [],
                "has_native_controller": false,
                "requires_keystore": false
            }
        ],
        "network_types": {
            "transport_types": {
                "netty4": 4
            },
            "http_types": {
                "netty4": 4
            }
        }
    }
}

A few things;

  • 6.2 is very long past EOL. Please upgrade
  • You have far too many shards for your cluster, which is likely causing the slow snapshot and restore process, and likely causing other issues with your cluster
1 Like

You have far too many shards in your cluster. Please read this old blog post for some guidance. Having lots of small shards in very very inefficient as every shard has some overhead and can cause both performance and stability issues. If you followed best practices around shard sizing the data volume you have would fit in less than 10 shards.

Thanks for the info. Stupid question maybe. What steps can be followed to change the number of shards. And can this be done without causing any downtime?

Try Shrink Index | Elasticsearch Reference [6.2] | Elastic to start, especially if you have time based data with multiple primary shards per index.