Index copy via snapshot & restore not working as expected

Hi all,

Running ES version 1.3.6. I'm trying to copy a very small index (my
kibana-int index). As far as I can tell, in 1.3.6 the recommended way to
do this is to snapshot and restore to a different name.

The snap shot goes great, runs in 8 seconds. Done.
The restore is what's killing me. The index is very small. When I do the
restore it puts the cluster into a red state (as expected from the 4th to
last paragraph in the documentation here
http://www.elasticsearch.org/guide/en/elasticsearch/reference/1.3/modules-snapshots.html.
The problem is that it remains in red state all night and really really
slows down the speed for all other cluster operations significantly. I
will show the commands being run. Hopefully you guys will notice that I'm
doing something wrong.

First, we get stats on the size of our index:

curl -XPUT 'http://DataNode1:9200/kibana-int/_status?pretty

summarized output:
{
"_shards" : {
"total" : 36,
"successful" : 36,
"failed" : 0
},
"_all" : {
"primaries" : {
"docs" : {
"count" : 30,
"deleted" : 1
},
"store" : {
"size_in_bytes" : 270961,
"throttle_time_in_millis" : 0
...}

So we're only dealing with 270961 bytes = 264k

Create repository:

curl -XPUT 'http://DataNode1:9200/_snapshot/kibana-int-backups?pretty' -d '{
"type": "fs",
"settings": {
"location": "/data/disk1/elasticsearch/backups",
"compress": true
}
}'

output:
{
"acknowledged" : true
}

Next we verify that the repo was created properly:

curl -XGET 'http://DataNode1:9200/_snapshot/kibana-int-backups?pretty'

output:
{
"kibana-int-backups" : {
"type" : "fs",
"settings" : {
"compress" : "true",
"location" : "/data/disk1/elasticsearch/backups"
}
}
}

Next, we create the snapshot of our kibana-int repository:

curl -XPUT 'DataNode1:9200/_snapshot/kibana-int-backups/snapshot_1?wait_for_completion=true&pretty'
-d '{
"indices": "kibana-int",
"ignore_unavailable": "false",
"include_global_state": "false"
}'

output:

{
"snapshot" : {
"snapshot" : "snapshot_1",
"indices" : [ "kibana-int" ],
"state" : "SUCCESS",
"start_time" : "2015-01-22T15:10:20.062Z",
"start_time_in_millis" : 1421939420062,
"end_time" : "2015-01-22T15:10:27.218Z",
"end_time_in_millis" : 1421939427218,
"duration_in_millis" : 7156,
"failures" : [ ],
"shards" : {
"total" : 12,
"failed" : 0,
"successful" : 12
}
}
}

Next, we go to copy kibana-int to another name:

curl -XPOST 'DataNode1:9200/_snapshot/kibana-int-backups/snapshot_1/_restore?wait_for_completion=true&pretty'
-d '{
"indices": "kibana-int",
"ignore_unavailable": "false",
"include_global_state": "false",
"rename_pattern": "kibana-int",
"rename_replacement": "copy-kibana-int"
}'

As expected the cluster state goes immediately to red:

curl -XPOST 'DataNode1:9200/_cluster/health?pretty

output:

{
"cluster_name" : "elasticsearch_cluster1",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 14,
"number_of_data_nodes" : 13,
"active_primary_shards" : 5723,
"active_shards" : 17169,
"relocating_shards" : 0,
"initializing_shards" : 6,
"unassigned_shards" : 12
}

I've let it run for over 12 hours and the cluster has never returned to a
green state!
So we start troubleshooting using the recovery and cat APIs:

curl -XGET 'DataNode1:9200/copy-kibana-int/_recovery?pretty | jq
'.["copy-kibana-int"]|.shards|.[]|.stage'

output:

"DONE"
"DONE"
"DONE"
"DONE"
"DONE"
"DONE"
"DONE"
"DONE"
"DONE"
"DONE"
"DONE"
"DONE"
"DONE"
"DONE"
"DONE"
"DONE"
"DONE"
"DONE"

So, as you can see, all the shards are in the done state.
While this is happening, all other queries in the cluster slowly crawl to a
halt and as I said, after 12 hours, this index is never fully replicated.
All copy-kibana-int shards are listed as un

Lastly, I checked the cat recovery API:

curl -XGET 'DataNode1:9200/_cat/recovery?v

copy-kibana-int 2 337 snapshot done n/a
DataNode3 kibana-int-backups snapshot_1 4 100.0% 9789
100.0%
copy-kibana-int 2 471 replica done DataNode3
DataNode8 n/a n/a 4 100.0% 9789
100.0%
copy-kibana-int 2 496 replica done DataNode3
DataNode9 n/a n/a 4 100.0% 9789
100.0%
copy-kibana-int 4 513 snapshot done n/a
DataNode6 kibana-int-backups snapshot_1 7 100.0% 19430
100.0%
copy-kibana-int 4 1231 replica done DataNode6
DataNode7 n/a n/a 7 100.0% 19430
100.0%
copy-kibana-int 4 535 replica done DataNode6
DataNode1 n/a n/a 7 100.0% 19430
100.0%
copy-kibana-int 5 197 snapshot done n/a
DataNode1 kibana-int-backups snapshot_1 4 100.0% 11582
100.0%
copy-kibana-int 5 444 replica done DataNode1
DataNode19 n/a n/a 4 100.0% 11582
100.0%
copy-kibana-int 5 391 replica done DataNode1
DataNode5 n/a n/a 4 100.0% 11582
100.0%
copy-kibana-int 6 456 replica done DataNode17
DataNode1 n/a n/a 13 100.0% 40273
100.0%
copy-kibana-int 6 984 snapshot done n/a
DataNode17 kibana-int-backups snapshot_1 13 100.0% 40273
100.0%
copy-kibana-int 6 1192 replica done DataNode17
DataNode5 n/a n/a 13 100.0% 40273
100.0%
copy-kibana-int 8 399 replica done DataNode19
DataNode12 n/a n/a 8 100.0% 29734
100.0%
copy-kibana-int 8 641 replica done DataNode19
DataNode4 n/a n/a 8 100.0% 29734
100.0%
copy-kibana-int 8 377 snapshot done n/a
DataNode19 kibana-int-backups snapshot_1 8 100.0% 29734
100.0%
copy-kibana-int 10 401 replica done DataNode8
DataNode3 n/a n/a 4 100.0% 10189
100.0%
copy-kibana-int 10 265 snapshot done n/a
DataNode8 kibana-int-backups snapshot_1 4 100.0% 10189
100.0%
copy-kibana-int 10 1071 replica done DataNode8
DataNode20 n/a n/a 4 100.0% 10189
100.0%

All the replicas are listed as 100%.
Any guidance here?
Apologies for the really long email.
Thanks,
Z

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEVz%3DD0Ex7-sYn0h97Nc-nyT%2BSrT0dLizm%3DfkoaDzyHB0zZTfA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.