Fastest copy of index

I know this question may have been asked before, but the responses in other threads are not sufficient for my use case.

Data:
Cluster: 6 nodes
Index size: 10GB
Document count: 35.000.000
Shards: 5

We want to be able to copy a current index to a new index.

We have tried the following options:

  • Reindex API: works, but restore process takes about 40 minutes
  • Snapshot and restore: It used to work well when the index was a half the size (+- 2 minutes). After our index has grown however, it now takes between 10 and 15 minutes. What is strange is that 3 or 4 out of 5 shards still recover in about 3-4 minutes. The remaining 1 or 2 shards take 10 minutes or more however to recover. We also verified that this happens regardless of which node is restoring those last shards, so it's not node-related.

What are other options to do an exact copy of an index in the fastest way?
Alternatively, how I can troubleshoot / optimize the snapshot/restore method to get back to previous timings?

Note that we only need a byte for byte copy, there's no need to modify templates, filter data, ...

Hey,

restoring sounds like the best option here, as this comes as close to byte copy as possible. You might want to read a bit through the indices recovery and shard allocation settings. I think the maximum recovery is set to 40MB per second, which is not a lot for an SSD. The number of shards being in recovered in parallel might also be interesting. Do you see how many shards are recovering in parallel?

And lastly, it would be super useful if you also tell us which Elasticsearch version you are using, in order to rule out potential bugs that you may not be hitting on newer versions.

--Alex

Hi,

Thanks for the info.

I have already set the max recovery to "1g", so I suppose that'll be maxed out already.
The parallel shards recovery I set to 5. Here's the request I did before restoring the indices:

{
    "transient" : {
        "indices.recovery.max_bytes_per_sec" : "1g",
        "cluster.routing.allocation.node_concurrent_recoveries" : "5"
    }
}

All shards are about 2GB, so I find it weird that the first few get recovered relatively fast, while the last ones or one take a lot of time.

The ES version we're using is 5.5.

Do you see that those shards are recovered in parallel using the indices recovery or the cat recovery API?

Yes, they are. I can see all shards being recovered at once. It's just that the first ones finish rather fast, except for the last one, which takes a long time. (Sometimes the last two)

I had a look at the recovery endpoint and found this:

This is about 5 minutes into the recovery process. 4 out of 5 shards have been recovered fully. One is still pending at took another 5 minutes to be fully recovered.

I also had a look at the node that was still recovering and saw that CPU usage was somewhat elevated (50%), but not maxed out. Network was also relatively low (< 5MBps)

(Note: I removed some properties from the JSON to ensure the body size of the reply is below 7000)

{
"name-of-my-index": {
    "shards": [
        {
            "id": 0,
            "type": "SNAPSHOT",
            "stage": "DONE",
            "primary": true,
            "start_time_in_millis": 1522082646114,
            "stop_time_in_millis": 1522082797602,
            "total_time_in_millis": 151490,
            "source": {
                "repository": "azurerepo",
                "version": "5.5.0",
                "index": "source-index"
            },
            "target": {
                "host": "10.0.0.32",
                "transport_address": "10.0.0.32:9300",
                "name": "es-node-32"
            },
            "index": {
                "size": {
                    "total_in_bytes": 2364627154,
                    "reused_in_bytes": 0,
                    "recovered_in_bytes": 2364627154,
                    "percent": "100.0%"
                },
                "files": {
                    "total": 15,
                    "reused": 0,
                    "recovered": 15,
                    "percent": "100.0%"
                },
                "total_time_in_millis": 151397,
                "source_throttle_time_in_millis": 0
            }
        },
        {
            "id": 1,
            "type": "SNAPSHOT",
            "stage": "DONE",
            "primary": true,
            "start_time_in_millis": 1522082646774,
            "stop_time_in_millis": 1522082792022,
            "total_time_in_millis": 145243,
            "source": {
                "repository": "azurerepo",
                "version": "5.5.0",
                "index": "source-index"
            },
            "target": {
                "host": "10.0.0.36",
                "transport_address": "10.0.0.36:9300",
                "name": "es-node-36"
            },
            "index": {
                "size": {
                    "total_in_bytes": 2351813707,
                    "reused_in_bytes": 0,
                    "recovered_in_bytes": 2351813707,
                    "percent": "100.0%"
                },
                "files": {
                    "total": 15,
                    "reused": 0,
                    "recovered": 15,
                    "percent": "100.0%"
                },
                "total_time_in_millis": 145140,
                "source_throttle_time_in_millis": 0,
            }
        },
        {
            "id": 2,
            "type": "SNAPSHOT",
            "stage": "DONE",
            "primary": true,
            "start_time_in_millis": 1522082646808,
            "stop_time_in_millis": 1522082799345,
            "total_time_in_millis": 152540,
            "source": {
                "repository": "azurerepo",
                "version": "5.5.0",
                "index": "source-index"
            },
            "target": {
                "host": "10.0.0.34",
                "transport_address": "10.0.0.34:9300",
                "name": "es-node-34"
            },
            "index": {
                "size": {
                    "total_in_bytes": 2368768858,
                    "reused_in_bytes": 0,
                    "recovered_in_bytes": 2368768858,
                    "percent": "100.0%"
                },
                "files": {
                    "total": 15,
                    "reused": 0,
                    "recovered": 15,
                    "percent": "100.0%"
                },
                "total_time_in_millis": 152432,
                "source_throttle_time_in_millis": 0,
            }
        },
        {
            "id": 3,
            "type": "SNAPSHOT",
            "stage": "DONE",
            "primary": true,
            "start_time_in_millis": 1522082646833,
            "stop_time_in_millis": 1522082792649,
            "total_time_in_millis": 145814,
            "source": {
                "repository": "azurerepo",
                "version": "5.5.0",
                "index": "source-index"
            },
            "target": {
                "host": "10.0.0.31",
                "transport_address": "10.0.0.31:9300",
                "name": "es-node-31"
            },
            "index": {
                "size": {
                    "total_in_bytes": 2354015576
                    "reused_in_bytes": 0
                    "recovered_in_bytes": 2354015576
                    "percent": "100.0%"
                },
                "files": {
                    "total": 15
                    "reused": 0
                    "recovered": 15
                    "percent": "100.0%"
                },
                "total_time_in_millis": 145709
                "source_throttle_time_in_millis": 0
            }
        },
        {
            "id": 4,
            "type": "SNAPSHOT",
            "stage": "INDEX",
            "primary": true,
            "start_time_in_millis": 1522082646155,
            "total_time_in_millis": 448751,
            "source": {
                "repository": "azurerepo",
                "version": "5.5.0",
                "index": "source-index"
            },
            "target": {
                "host": "10.0.0.35",
                "transport_address": "10.0.0.35:9300",
                "name": "es-node-35"
            },
            "index": {
                "size": {
                    "total_in_bytes": 2072413946,
                    "reused_in_bytes": 0,
                    "recovered_in_bytes": 1363323599,
                    "percent": "65.8%"
                },
                "files": {
                    "total": 15,
                    "reused": 0,
                    "recovered": 7,
                    "percent": "46.7%"
                },
                "total_time_in_millis": 448742,
                "source_throttle_time_in_millis": 0
            }
        }
    ]
}

}

I did some more investigation and noticed that when creating the initial snapshot, it exhibits the same behavior: 4 out of 5 shards are very quick to snapshot, but 1 of them takes a very long time:

Output taken from /_snapshot/-repo-/-snapshot-/_status:

"shards": {
	"0": {
		"stage": "DONE",
		"stats": {
			"number_of_files": 15,
			"processed_files": 15,
			"total_size_in_bytes": 2364627154,
			"processed_size_in_bytes": 2364627154,
			"start_time_in_millis": 1522108912771,
			"time_in_millis": 185400
		}
	},
	"1": {
		"stage": "DONE",
		"stats": {
			"number_of_files": 15,
			"processed_files": 15,
			"total_size_in_bytes": 2351813707,
			"processed_size_in_bytes": 2351813707,
			"start_time_in_millis": 1522108887293,
			"time_in_millis": 191809
		}
	},
	"2": {
		"stage": "DONE",
		"stats": {
			"number_of_files": 15,
			"processed_files": 15,
			"total_size_in_bytes": 2368768858,
			"processed_size_in_bytes": 2368768858,
			"start_time_in_millis": 1522108890010,
			"time_in_millis": 793881
		}
	},
	"3": {
		"stage": "DONE",
		"stats": {
			"number_of_files": 15,
			"processed_files": 15,
			"total_size_in_bytes": 2354015576,
			"processed_size_in_bytes": 2354015576,
			"start_time_in_millis": 1522108887076,
			"time_in_millis": 221145
		}
	},
	"4": {
		"stage": "DONE",
		"stats": {
			"number_of_files": 15,
			"processed_files": 15,
			"total_size_in_bytes": 2072413946,
			"processed_size_in_bytes": 2072413946,
			"start_time_in_millis": 1522108887114,
			"time_in_millis": 169030
		}
	}
}

As you can see, the shard with index 2 is taking a very long time to snapshot:

0 => 185400 (3m5s - 2,2GB)
1 => 191809 (3m11s - 2,19GB)
2 => 793881 (13m13s - 2,2GB)
3 => 221145 (3m41s - 2,19GB)
4 => 169030 (2m49s - 1,93GB)

Given that the size of the shards is very similar, what could be the reason that one of the shards takes longer to snapshot / restore?

I assume this is a network based filesystem somewhere? Did you do a test that copying other huge files from that share does not result in losing performance over time?

Also are there any messages in the logs (log of the node doing the recovery or the master node)?

The filesystem under elasticsearch is a locally attached disk (It's in Azure, so I'm not sure where the actual disk is, but it has local SSD performance).

I haven't had any issues copying huge files. In fact, in the case above, I can see 4/5 files being copied really fast, so I don't think the issue is there.

The repository is an Azure blob storage repository, which is reasonably fast as well. We also do other operations on the cluster pulling and writing to the repository without an issue.

I have checked the logs of the nodes, but couldn't find anything in particular. The only thing I have found is that all nodes (not just the ones with the slow restore) do have quite a few GC overhead logs (20% of time is spent GC'ing). I have implemented some measures to reduce the overall shard count as we had loads of small shards (< 100MB) and I'm consolidating them into smaller shards (by reindexing). However, I don't think it's related to this particular.

Given that the same issue occurs in both snapshotting and restoring with a single shard, is it possible there's some data or shard "corruption" going on?

there are so many buffers on your way to copy 15gb, that there is always something being able to be filled up. But I am indeed less worried about the local disk than the blog storage repository. Did you test copying larger files from there as well?

The GC overhead is pretty big indeed.

When you restore into a dedicated one node cluster that is not part of that cluster. Do you encounter the same issues (this would help to rule out the cluster setup or something).

I haven't tested restoring into a one node cluster. I'd have to setup that.

However, we haven't had any issues with the blob storage repo:

  • We have been restoring indices from there for quite a while with no issue
  • We are still restoring other indices from there quite quickly
  • 4/5 shards are restoring really fast.

I don't see an issue with the speed of the connection. If I look at the network load on the slowly restoring node, I see it only has about 1-2MBps going. I can defintely say that the network connection is much faster to Azure blob storage.

I could see a potential issue with the network speed if all 5 shards were taking the same amount, but the fact that 4/5 copy very fast, I don't think it's an issue with the repo.

I'm now going to try and rebuild the index from scratch and see whether a restore of that new index is having the same issue

The rebuild of the index (had started it this morning) just finished and I created a snapshot. It shows the same behavior. The snapshot is of 5 indices , 4 of them are rather small and consist of a single shard.

The one with the issue, is named "index-with-the-issue" in the attached file and you can see the shard with index 4 takes about 5x the amount of time to snapshot in comparison with the others.

I also did a file transfer just now from the blob storage account and got a throughput of around 50MB/s so the speed seems to be OK (file transfer would around 40s)

Link to the log: https://gist.github.com/Kennethtruyers/a48fdc3db0382fa843eb73b823ad9942

I have just executed the same restore (from the same repository ) on a single node cluster.

Cluster
--------
Nodes: 6
CPU: Xeon E5v3 2.4Ghz
RAM: 56GB
Heap size: 28GB
Total indexes/shards before restore: 60/220
Settings:
    {    "transient": {
           "cluster": {
              "routing": {
                "allocation": {
                    "node_concurrent_recoveries": "5"
             }}},
        "indices": {
            "recovery": {
                "max_bytes_per_sec": "1g"
            }}}}

Single node
-----------
Nodes: 1
CPU: Core i7 6700 4Ghz
RAM: 64GB
Heap size: 12GB
Total indexes/shards before restore: 0/0
Settings:
    {    "transient": {
           "cluster": {
              "routing": {
                "allocation": {
                    "node_concurrent_recoveries": "5"
             }}},
        "indices": {
            "recovery": {
                "max_bytes_per_sec": "1g"
            }}}}

The recovery time for all shards on the single node was around 7 minutes, which is quite a bit faster than what I measured on the cluster.

Checking the network details, I get a download speed up to 200MBps in the beginning and later up to around 100MBps (when some of the shards finished their restore).

My conclusion is that it's not something related to the repository, as it was the same in both tests, but it must be something with the cluster configuration. However, I'm still not sure what the issue is or how to debug that

I have done another test, this time looking a bit more into the network side.

I have noticed that on the node where the restore process is slow I see the following patternbandwidth

As you can see, it looks like it's constantly interrupting the connection. Is there any setting / log I can check to verify this?

On the other nodes, the bandwidth is fairly constant, there are some variations but no vertical drops like on this machine.

I've done a couple more tests and I see the above pattern repeating on every recovery. It seems to happen on a different node every time.

Since that's also the node that takes the extra amount of time to recover, I'm quite certain that this is the culprit.

I've done some network tests and on the server level everything seems to be OK, so this must lie somewhere in the configuration of the cluster (especially since it's usually a different node that has this behavior).

I have enabled trace logging of the transport module, but can't see anything suspicious in the logs.

@spinscale Sorry to bump this, but do you have any insight or pointers where to look for this?

I think it's the restore throttling setting kicking in:

max_restore_bytes_per_sec, which defaults to 40mb per second.

https://www.elastic.co/guide/en/elasticsearch/reference/5.5/modules-snapshots.html

Can you set this to a higher value on your repository?

I have it set to 1g, as mentioned above:

{
"transient" : {
    "indices.recovery.max_bytes_per_sec" : "1g",
    "cluster.routing.allocation.node_concurrent_recoveries" : "5"
    }
}

I find it odd that the network profile is like that, it does indeed look like it's throttling somewhere, but I can't figure out where. The recovery limit seems high enough that it shouldn't kick in (I'd be happy if we got 1g/s)

Also, why would that only be the case on 1 out of 6 nodes?

the two settings are different... one is recovery, one is restore bytes per second

Oh, good point I had missed that.

I just incremented that specific setting (to 1g, just to check if unbounded would make a difference). The result is now a bit different, but ultimately still the same.

Now, rather than 1 node having that network profile, it seems like all nodes are having the same issue. The total restore time is now 8 minutes (for in total around 7GB of data). Given that the raw network throughput is much higher than that, I still wonder which part in the system is the bottleneck.