Elasticsearch snapshot and restore best practices

eolver · September 14, 2021, 5:44am

Hi All,

I want to know the best way, to store elastic data for recovery.

Currently we had each cluster hosted in azure and the cluster backed up daily using RSV.

However we wanted to do more frequently snapshots which make it more flexible for what we want to restore. So we configured hourly backups to azure storage.

In the event of a disaster restoring takes over 24 hours to restore the entire cluster, but also the cost of hourly snapshots is so high on azure storage (GRS)

we managed to get hourly snapshot and hourly restore to a 2nd region but im now questioning the snapshot backups to storage due to the cost that is growing. even with the delete snapshot older than 7 days its not helping.

Can anyone out there help make better suggestion or recommend how it could be more cost effective but also take less than 24 hours to restore an entire cluster.

Thanks

warkolm · September 14, 2021, 5:57am

Welcome to our community!

How big is your cluster? The output from the _cluster/stats?pretty&human API might be helpful here.

DavidTurner · September 14, 2021, 10:38am

RSV is a filesystem-level backup. The docs say not to do this:

WARNING: The only reliable and supported way to back up a cluster is by taking a snapshot . You cannot back up an Elasticsearch cluster by making copies of the data directories of its nodes. There are no supported methods to restore any data from a filesystem-level backup. If you try to restore a cluster from such a backup, it may fail with reports of corruption or missing files or other data inconsistencies, or it may appear to have succeeded having silently lost some of your data.

That seems worth investigating further: you should be able to restore many TBs onto a single node in 24h, and it should scale linearly in the number of nodes. How much data are you talking about here?

eolver · September 15, 2021, 3:18am

Thanks for the response!

I think this is the info your looking for..

"store": {
"size": "88.4gb",

warkolm · September 15, 2021, 3:24am

I was after the full output please.

Please also format your code/logs/config using the </> button, or markdown style back ticks. It helps to make things easy to read which helps us help you

eolver · September 15, 2021, 3:57am

Thanks

{
    "_nodes": {
        "total": 4,
        "successful": 4,
        "failed": 0
    },
    "cluster_name": "cluster",
    "timestamp": 1631673830321,
    "status": "green",
    "indices": {
        "count": 4070,
        "shards": {
            "total": 40700,
            "primaries": 20350,
            "replication": 1.0,
            "index": {
                "shards": {
                    "min": 10,
                    "max": 10,
                    "avg": 10.0
                },
                "primaries": {
                    "min": 5,
                    "max": 5,
                    "avg": 5.0
                },
                "replication": {
                    "min": 1.0,
                    "max": 1.0,
                    "avg": 1.0
                }
            }
        },
        "docs": {
            "count": 99590293,
            "deleted": 731484
        },
        "store": {
            "size": "88.4gb",
            "size_in_bytes": 94975452248
        },
        "fielddata": {
            "memory_size": "8.9mb",
            "memory_size_in_bytes": 9433584,
            "evictions": 0
        },
        "query_cache": {
            "memory_size": "25.5mb",
            "memory_size_in_bytes": 26785800,
            "total_count": 19805451,
            "hit_count": 6563667,
            "miss_count": 13241784,
            "cache_size": 10295,
            "cache_count": 303694,
            "evictions": 293399
        },
        "completion": {
            "size": "0b",
            "size_in_bytes": 0
        },
        "segments": {
            "count": 188194,
            "memory": "894.1mb",
            "memory_in_bytes": 937551115,
            "terms_memory": "618.4mb",
            "terms_memory_in_bytes": 648523110,
            "stored_fields_memory": "72.4mb",
            "stored_fields_memory_in_bytes": 76015168,
            "term_vectors_memory": "1.8mb",
            "term_vectors_memory_in_bytes": 1969944,
            "norms_memory": "21.4mb",
            "norms_memory_in_bytes": 22528768,
            "points_memory": "19.2mb",
            "points_memory_in_bytes": 20196533,
            "doc_values_memory": "160.5mb",
            "doc_values_memory_in_bytes": 168317592,
            "index_writer_memory": "0b",
            "index_writer_memory_in_bytes": 0,
            "version_map_memory": "0b",
            "version_map_memory_in_bytes": 0,
            "fixed_bit_set": "1.1mb",
            "fixed_bit_set_memory_in_bytes": 1243192,
            "max_unsafe_auto_id_timestamp": 1631360195137,
            "file_sizes": {}
        }
    },
    "nodes": {
        "count": {
            "total": 4,
            "data": 3,
            "coordinating_only": 1,
            "master": 3,
            "ingest": 3
        },
        "versions": [
            "6.2.2"
        ],
        "os": {
            "available_processors": 16,
            "allocated_processors": 16,
            "names": [
                {
                    "name": "Linux",
                    "count": 4
                }
            ],
            "mem": {
                "total": "125.6gb",
                "total_in_bytes": 134948945920,
                "free": "3.2gb",
                "free_in_bytes": 3536654336,
                "used": "122.3gb",
                "used_in_bytes": 131412291584,
                "free_percent": 3,
                "used_percent": 97
            }
        },
        "process": {
            "cpu": {
                "percent": 203
            },
            "open_file_descriptors": {
                "min": 259,
                "max": 33230,
                "avg": 24952
            }
        },
        "jvm": {
            "max_uptime": "96.6d",
            "max_uptime_in_millis": 8348291343,
            "versions": [
                {
                    "version": "1.8.0_252",
                    "vm_name": "OpenJDK 64-Bit Server VM",
                    "vm_version": "25.252-b09",
                    "vm_vendor": "Oracle Corporation",
                    "count": 4
                }
            ],
            "mem": {
                "heap_used": "28.6gb",
                "heap_used_in_bytes": 30714431080,
                "heap_max": "63.8gb",
                "heap_max_in_bytes": 68580016128
            },
            "threads": 627
        },
        "fs": {
            "total": "404.1gb",
            "total_in_bytes": 433937915904,
            "free": "302gb",
            "free_in_bytes": 324304998400,
            "available": "282.9gb",
            "available_in_bytes": 303800209408
        },
        "plugins": [
            {
                "name": "repository-azure",
                "version": "6.2.2",
                "description": "The Azure Repository plugin adds support for Azure storage repositories.",
                "classname": "org.elasticsearch.plugin.repository.azure.AzureRepositoryPlugin",
                "extended_plugins": [],
                "has_native_controller": false,
                "requires_keystore": false
            }
        ],
        "network_types": {
            "transport_types": {
                "netty4": 4
            },
            "http_types": {
                "netty4": 4
            }
        }
    }
}

warkolm · September 15, 2021, 4:07am

A few things;

6.2 is very long past EOL. Please upgrade
You have far too many shards for your cluster, which is likely causing the slow snapshot and restore process, and likely causing other issues with your cluster

Christian_Dahlqvist · September 15, 2021, 5:26am

You have far too many shards in your cluster. Please read this old blog post for some guidance. Having lots of small shards in very very inefficient as every shard has some overhead and can cause both performance and stability issues. If you followed best practices around shard sizing the data volume you have would fit in less than 10 shards.

eolver · September 20, 2021, 5:56am

Thanks for the info. Stupid question maybe. What steps can be followed to change the number of shards. And can this be done without causing any downtime?

warkolm · September 20, 2021, 5:59am

Try Shrink Index | Elasticsearch Reference [6.2] | Elastic to start, especially if you have time based data with multiple primary shards per index.

system · October 18, 2021, 6:00am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
ElasticSearch backup techniques and strategies? Elasticsearch	2	1020	July 5, 2017
Backup/Restore Question Elasticsearch	5	2369	January 6, 2017
Monthly backup and restore along with snapshot Elasticsearch	3	1765	December 6, 2019
How to (offsite) backup and recover snapshots Elasticsearch	6	437	September 12, 2022
Backup in S3 Elasticsearch	2	447	July 5, 2017

Elasticsearch snapshot and restore best practices

Related topics