Populate data from production to test server with less nodes

I'm trying to use the backup and restore functionality via the kopf plugin in order to populate the production data to a test server.

The production setup consists of 3 nodes. The test setup is one node.
I'm able to do what I need with one of the two indexes (1 shard, 1 replica).
But when I try to do this with the second index (2 shards, 1 replica), the restore renders the index unusable because of unassigned shards.

In order to setup the test server, I needed to make it an exact copy of the production server (3 nodes etc) which is a waste of resources. Is what I'm trying to do even possible?

I'm using ES 1.7.5

What do you mean by "unusable"? Is the index in a red state?

You should be able to restore a three-node cluster into a single-node cluster. The replicas will be unassigned, because they won't allocate to the same node as the primary shards. But all the primary shards should allocate and your cluster will be in a "yellow" state.

The index is in a red state yes.

Read through this section of the docs and see if anything stands out to you: https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-snapshots.html#_restoring_to_a_different_cluster

In particular:

  • Do you use allocation filtering/awareness on your cluster? Those settings carry over, so your new single-node may not have the right tags configured (causing shards to not allocate)
  • Did you see any exceptions during the snapshot process? The snapshot may have been incomplete, which means the restore will be incomplete
  • Any exceptions or logs when restoring?
  • Do you have enough disk space? Is the allocation decider logging about high/low disk watermarks?
  • Do you have settings like total_shards_per_node enabled? Or allocation entirely disabled?

First of all we are using ES as a service by cloud.

  1. No such setting is known to us. We use the elastic cloud service and not our own installation. I believe the nature of the service wouldn't play well with a setting like this.
  2. No exceptions during the snapshot.
  3. Logs are not accessible. So I assume no exceptions either since the snapshot operation was successful.
  4. Yes this disk space is more than enough.
  5. No setting like this was added by us.

Is there a way to exclude the replicas from restoring? Can I omit them during the snapshot phase?

Hm, I'm not sure tbh. Replicas aren't saved when you create a snapshot (since it's just extra copies of the same data, no need to back it up). They are rebuilt after the primary shards have been restored to the cluster.

So if your cluster is staying red, there is either a problem with the primary shards you are trying to restore, or some kind of external factor that is preventing all the primaries from restoring onto the new cluster.

Is there a same-named index that you are trying to restore "on top" of?

Were the indices green when you snapshotted them? You could try doing a partial restore (https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-snapshots.html#_partial_restore) to see if that helps.

Are you restoring to the same version of Elasticsearch that you snapshotted from?

Ok I'm trying it locally with a docker image.
I'm restoring two indices one by one. First the small, this has one shard and it's relatively small (100Mb). At first the shard was in INIT state and remained like that for some minutes. After that
and got to START state and the cluster was yellow and accessible again. Now I'm doing the same with the big index (1,5 Gb). There are two shards both in INIT state. I'm checking the progress via the recovery api. Maybe the problem was that when I was doing it in the elastic cloud setup, I didn't give it enough time? If that's the case (because I waited for about 15mins) why did it finish in just a few minutes when I had the exact same setup (number of nodes etc)? Is the recovery faster when using more nodes?

It will take some time to recover indices, depending on size. Not sure how much time to be honest, it depends on hardware + index size. 15 min does sound like a long time for just 1.5gb though.

There are a few things that affect recovery speed. More nodes == more disk IO and network resources, so recovery tends to go faster. There are also throttling settings that limit how many concurrent recoveries a single node can be doing (to prevent a lot of recoveries from swamping a single node in production), so having more nodes means more parallel recoveries and you won't hit the throttling.

Dunno if that's what you ran into. It may have been something else, but I'm still thinking :confused:

Feel free to post something/most this thread in the Cloud category as well, the engineers will be happy to help!