Index restoration from snapshot is taking a lot of time in 7.2.0 ES

Elasticsearch version ( bin/elasticsearch --version ): 7.2.0

Plugins installed :

JVM version ( java -version ): JDK 12 (bundled)

OS version ( uname -a if on a Unix-like system): Ubuntu 14.04.5 LTS

This is in reference to https://github.com/elastic/elasticsearch/issues/55013

Description of the problem including expected versus actual behaviour :

We recently migrated to 7.2.0 ES and setup a backup cron that takes snapshot of production every 2hr and does a restore. The restore however takes around 2-3hrs to complete. Whereas in 1.7 earlier it used to finish within a minute

During restore, what we've observed is that, cat/indices doesn't list all the indices. At first we thought if restore is deleting indices? That would be bad. But from disk utilization graphs, that kind of wasn't the case. However there was a good amount of drop in used space in our disk (roughly half as shown here)

Screenshot 2020-04-09 at 6 35 25 PM

At time 17:09, the restore started and if you see, the disk usage dropped by almost half from ~300GB to ~150GB

cat/shards show all the primaries are in INITIALIZED state

sample line from cat/recovery

issues-index 0  4m snapshot index n/a n/a x.x.x.x es7advcl01-09 es7advcl02 snapshot_200409120001 313 9  2.9% 319 36982790918 2508329227 6.8%  37255110711 0 0 100.0%
issues-index 1  4m snapshot index n/a n/a x.x.x.x es7advcl01-13 es7advcl02 snapshot_200409120001 331 9  2.7% 356 35062077787 2528382202 7.2%  37816732630 0 0 100.0%
issues-index 2  4m snapshot index n/a n/a x.x.x.x es7advcl01-11 es7advcl02 snapshot_200409120001 277 1  0.4% 308 31645896495 2528871414 8.0%  37250260213 0 0 100.0%
issues-index 3  4m snapshot index n/a n/a x.x.x.x es7advcl01-07 es7advcl02 snapshot_200409120001 278 9  3.2% 278 37237367179 2505543480 6.7%  37237367179 0 0 100.0%

explain api on a primary shard

{
  "index": "issues-index",
  "shard": 0,
  "primary": true,
  "current_state": "initializing",
  "unassigned_info": {
    "reason": "EXISTING_INDEX_RESTORED",
    "at": "2020-04-09T13:20:41.435Z",
    "details": "restore_source[es7advcl02/snapshot_200409120001]",
    "last_allocation_status": "awaiting_info"
  },
  "current_node": {
    "id": "0ccKssgEQmayndGpJA0Kfg",
    "name": "es7advcl01-09",
    "transport_address": "x.x.x.x:xxxx",
    "attributes": {
      "nodetag": "issues",
      "ml.machine_memory": "64388841472",
      "ml.max_open_jobs": "20",
      "xpack.installed": "true"
    }
  },
  "explanation": "the shard is in the process of initializing on node [es7advcl01-09], wait until initialization has completed"
}

sample index/_recovery response

{
  "issues-index": {
    "shards": [
      {
        "id": 16,
        "type": "SNAPSHOT",
        "stage": "INDEX",
        "primary": true,
        "start_time_in_millis": 1586438441641,
        "total_time_in_millis": 2599243,
        "source": {
          "repository": "es7advcl02",
          "snapshot": "snapshot_200409120001",
          "version": "7.2.0",
          "index": "issues-index",
          "restoreUUID": "6cdxLeuWQmaqRNkhXWupKw"
        },
        "index": {
          "size": {
            "total_in_bytes": 36833533836,
            "reused_in_bytes": 393555386,
            "recovered_in_bytes": 27281694426,
            "percent": "74.9%"
          },
          "files": {
            "total": 337,
            "reused": 6,
            "recovered": 108,
            "percent": "32.6%"
          },
          "total_time_in_millis": 2599240,
          "source_throttle_time_in_millis": 0,
          "target_throttle_time_in_millis": 0
        },
        "translog": {
          "recovered": 0,
          "total": 0,
          "percent": "100.0%",
          "total_on_start": 0,
          "total_time_in_millis": 0
        },
        "verify_index": {
          "check_index_time_in_millis": 0,
          "total_time_in_millis": 0
        }
      },
      {
        "id": 24,
        "type": "SNAPSHOT",
        "stage": "INDEX",
        "primary": true,
        "start_time_in_millis": 1586438441644,
        "total_time_in_millis": 2599241,
        "source": {
          "repository": "es7advcl02",
          "snapshot": "snapshot_200409120001",
          "version": "7.2.0",
          "index": "issues-index",
          "restoreUUID": "6cdxLeuWQmaqRNkhXWupKw"
        },
        "index": {
          "size": {
            "total_in_bytes": 37394586004,
            "reused_in_bytes": 384374051,
            "recovered_in_bytes": 27237551760,
            "percent": "73.6%"
          },
          "files": {
            "total": 337,
            "reused": 6,
            "recovered": 117,
            "percent": "35.3%"
          },
          "total_time_in_millis": 2599238,
          "source_throttle_time_in_millis": 0,
          "target_throttle_time_in_millis": 0
        },
        "translog": {
          "recovered": 0,
          "total": 0,
          "percent": "100.0%",
          "total_on_start": 0,
          "total_time_in_millis": 0
        },
        "verify_index": {
          "check_index_time_in_millis": 0,
          "total_time_in_millis": 0
        }
      },
      {
        "id": 8,
        "type": "SNAPSHOT",
        "stage": "INDEX",
        "primary": true,
        "start_time_in_millis": 1586438441638,
        "total_time_in_millis": 2599247,
        "source": {
          "repository": "es7advcl02",
          "snapshot": "snapshot_200409120001",
          "version": "7.2.0",
          "index": "issues-index",
          "restoreUUID": "6cdxLeuWQmaqRNkhXWupKw"
        },
        "index": {
          "size": {
            "total_in_bytes": 37646982630,
            "reused_in_bytes": 0,
            "recovered_in_bytes": 27241548911,
            "percent": "72.4%"
          },
          "files": {
            "total": 326,
            "reused": 0,
            "recovered": 119,
            "percent": "36.5%"
          },
          "total_time_in_millis": 2599244,
          "source_throttle_time_in_millis": 0,
          "target_throttle_time_in_millis": 0
        },
        "translog": {
          "recovered": 0,
          "total": 0,
          "percent": "100.0%",
          "total_on_start": 0,
          "total_time_in_millis": 0
        },
        "verify_index": {
          "check_index_time_in_millis": 0,
          "total_time_in_millis": 0
        }
      }
    ]
  }
}

index.{files,size}.reused seems to be very less for every shard

Here are some things that happened while snapshots were taken every 2hrs

  1. We had the index setup on production
  2. Snapshot was taken of this index with some data
  3. We figured out some bug in mapping
  4. We deleted the index and recreated with same name
  5. All this time snapshots were still happening
  6. We indexed 2TB of data as part of our migration. Settings like refresh interval and replicas=0 were set and were reverted back to default once migration was done. forcemerge API was called after indexing completed
  7. Snapshots kept on happening
  8. Restores started to slow down. I understand the very first restore can be slow. But I expect subsequent restores to be fast because they are incremental. We see slowness in subsequent restores as well which is undesirable as a DR strategy

We restore using

curl host:port/_snapshot/repo/snapshot-name/_restore

We also tried cleaning the entire backup cluster (removed data folder as well and do a fresh restore), but that didnt help either

We also tried

curl host:port/_snapshot/repo/snapshot-name/_restore
{
   "include_global_state": true
}

We make use of node attributes on production cluster, here is the index settings

{
  "issues-index": {
    "settings": {
      "index": {
        "routing": {
          "allocation": {
            "include": {
              "nodetag": "issues"
            }
          }
        }
      }
    }
  }
}

for remaining indices, nodetag has the value of nonissues . Both backup and production have the same set of attributes so this is not a concern

I also just now confirmed that, during restore ES does not move primary shards to other nodes. This kind of rules out rebalancing happening due to presence of nodetag attribute above?

This sounds like a very strange way to operate a cluster. Snapshots copy segments incrementally, so by force-merging your indices before snapshotting them you are losing all incrementality for both the snapshot and the subsequent restore. You should also not force-merge indices if you will continue writing to them.

If the RTO offered by restoring a snapshot is a problem then you would do well to use cross-cluster replication instead, since this lets you run a hot DR cluster that typically lags behind your production cluster by a very small amount and is always ready for promotion.

1 Like

@DavidTurner no no, i forcemerge'd only once after the initial indexing. sorry if it wasn't clear. We haven't done any forcemerge after that as we know it doesn't play well with snapshot/restore. We do have a lot of update traffic though. It's strange why 1.x restored faster. Maybe the segment merging algorithm has changed that causes incremental snapshot to be not much helpful anymore?

We as of now don't have any plans of buying a platinum license, so cross cluster replication might be out of the picture

As per the force merge docs:

Force merge should only be called against an index after you have finished writing to it.

You are saying that you force-merged after the initial indexing and then carried on writing to the index, which goes against the documented recommendations.

That's the regular traffic of our system that is writing (we can't stop that and we might have as well skipped force merging, we'll keep that in mind)

Anyways, i'm guessing that would impact the initial snapshot/restore. Subsequent restore should've been faster. In 1.7 it took merely 1 minute to restore. But in 7.2 it takes around 2hr just to restore the primary shards (i'm not talking about the 1st restore when backup is empty, the 1st restore will be slow and that's fine).

From the above stats, it is hardly reusing files when restoring @DavidTurner
Should we try creating a new repository and do a fresh snapshot and restore? I'm guessing some mess up has happened in the previous snapshots

Hm ok I failed to understand from your OP that you were seeing a 100x increase in recovery time. That's surprising. Could you set logger.org.elasticsearch.repositories.blobstore.FileRestoreContext: TRACE, then attempt two consecutive restores, the second of which you expect to be incremental on the first, and then share the resulting logs?

Sure will try to set the logs. You want the logs from master nodes / data nodes?

All nodes please.

Will provide by tomorrow. Thanks a lot @DavidTurner

Please also share the output of GET <INDEX>/_recovery at the end of both phases of the restore.

1 Like

@DavidTurner here is the attached log files

01,02,03 are master nodes, rest of them are data nodes. initial-restore.json & incremental-restore.json are the output of issues/_recovery

The first restore (full restore) happened around 2020-04-12T07:26:18. The incremental restore (2nd restore) happened around 2020-04-12T09:04:20

FYI, meanwhile we also tried creating a snapshot onto a new repository (clean slate). That too was slow in restoring incrementally

What do you mean by incremental restore? Is it restoring a different, newer set of indices?

The restore process always restores the complete indices as per the time the snapshot was taken.

@Christian_Dahlqvist I meant the restore of the 2nd snapshot (i.e. a restore had already happened on the backup before this and there is already 1TB of data in it). The restore of 1st snapshot will be the longest restore. Subsequent restores of snapshots will only pull in the diff. No, it is restoring the same index. AFAIK in 1.7 it took merely seconds to restore a snapshot (except the 1st snapshot when the backup is empty). Unless things have changed in 7.2

The index snapshot process is incremental. In the process of making the index snapshot Elasticsearch analyses the list of the index files that are already stored in the repository and copies only files that were created or changed since the last snapshot. That allows multiple snapshots to be preserved in the repository in a compact form. Snapshotting process is executed in non-blocking fashion. All indexing and searching operation can continue to be executed against the index that is being snapshotted. However, a snapshot represents the point-in-time view of the index at the moment when snapshot was created, so no records that were added to the index after the snapshot process was started will be present in the snapshot. The snapshot process starts immediately for the primary shards that has been started and are not relocating at the moment

@DavidTurner maybe you can explain how things have changed in snapshot/restore logic for the following scenario

Let's say a production cluster exists. snapshots are being taken every 2hr. Now we have a backup cluster which is empty

We do a restore from the latest snapshot taken. What's the expectation here?
Now let's say backup was restored with the latest snapshot. Now a new snapshot was made. We restore backup from this new snapshot (keep in mind backup already has indices from previous snapshot + let's assume no more indices are created, only 2hr worth of data came into the system)

In 1.7, clearly the restore just copies necessary segment files keeping the untouched segment files as is. In 7.2 that doesn't seem to be the case. This is either a bug / change in logic over time

Are we expected for every restore to create indices from scratch no matter if the backup has already been restored with a snapshot earlier?

This is all true but irrelevant since it only talks about the snapshot process and says nothing about restores. Restores are in fact incremental in terms of segments and not files. In the logs you shared, it looks like almost every segment is changed (probably by deleting/updating some of the docs it contains) and this is why Elasticsearch is restoring them all again. There are a few unchanged segments which are being skipped as expected, but that's very much the exception. Elasticsearch isn't recreating the indices from scratch as you claim, it's just reacting to the broad distribution of updates/deletes in your index.

This also makes it clear that force-merging and then continuing to write data is adding to your woes since the force-merged segment is large and has a high probability of being changed and therefore requiring a full restore. You really should follow the recommendations in the manual that I mentioned above in this regard.

I'm willing to believe that this area has changed drastically in the 3+ years since 1.7 was released; maybe the old behaviour was buggy in some cases. I'm not especially keen to dig through the history in detail to find the exact point where this changed, but don't let that stop you from investigating further if you want to.

2 Likes

We'll reindex the documents again and this time not forcemerge. Thanks for the help @DavidTurner

@DavidTurner

So we reindexed all our data again and did not forcemerge (but write traffic is still happening)
I don't see any improvement in restoration either way

One thing we noticed is that, if we took snapshots every 15minutes, then the restoration reused a lot of files (as per _recovery output), but when the snapshot was taken every 2hr then the number of files reused were quite less

On a side note, im curious why index.recovery.max_bytes_per_sec does not affect the speed in which replicas get initialized from primaries. From graphs, I always see around 40mb of data transfer per node even though the property above has been set to 200mb

In which case it sounds like you are updating so many documents that you're hitting all your segments causing them to need restoring.

FWIW this logic hasn't really changed since version 1.4, although the context in which it's used is of course radically different today. I can see some scope for improvements in this area that weren't applicable in the past and opened https://github.com/elastic/elasticsearch/issues/55142 (see also https://issues.apache.org/jira/browse/LUCENE-9324).

I haven't previously heard of anyone else with similar issues, corroborating what I said earlier about this being a strange way to operate a cluster.

@DavidTurner could you explain why the way we operate our cluster is strange? Would like to know if we are doing something wrong

Actually we are migrating to 7.2 from 1.7, which is why i've been talking about the reindexing. We have to move all our documents from 1.7 to 7.2 & the write traffic also needs to switch to 7.2. I agree that ES is primarily not some sort of datastore where you do lots of update, but that's our application & ES was necessary for our search capabilities

Snapshots are not instantaneous; they take time to complete and do not represent perfect point-in-time views of the cluster. While a snapshot is in-progress, you can still index documents and make other requests to the cluster mcdvoice