Snapshot Restore from s3 or fs

Srivastavark · March 7, 2017, 10:51am

Hello Experts,

I am working on ES upgrade from 1.4.4 to 2.4.0 and want to migrate single index data(250 GB) from cluster A(1.4.4) to Cluster B(2.4.0). Earlier i have taken snapshot on S3 and tried to restore in Cluster B(2.4.0) , it took 5 days to create the index and mapping and finally got failed.
Now i have copied snapshot from s3 to fs and want to restore from fs ( path.repo : /data/s3data/).

Is this approach would be better to restore ?
Clsuter Detail -- Below
Cluster - 1 Master, 2 Master-data, 2 data
All master-data and data node and 64GB RAM and 1.29 TB space . Master node have 32 GB and 660 GB space.
Heap - not more than half of total available memory.

Yml config - all default except below

cluster.name: myname
node.name: prodnode1
node.master: true
node.data: true
path.data: /data/elasticsearch
path.logs: /var/log/elasticsearch
path.repo: /data/datas3/elastic-search-backup
bootstrap.memory_lock: true
network.host: hostIP
discovery.zen.ping.unicast.hosts: ["master-only", "master-data", "master-data"]
discovery.zen.minimum_master_nodes: 2

Any help appreciated. Thanks in advance.

Regards
Mithlesh

dadoonet · March 7, 2017, 1:02pm

What was the error message?

Note that IMO it's better to first upgrade to 1.7, install the migration plugin and check that upgrade can be done. Look at Breaking changes in 2.0 | Elasticsearch Guide [2.4] | Elastic

My advice would be though to create a brand new 5.2 cluster and use instead reindex from remote API to reindex your data existing in 1.4 cluster directly in a 5.x cluster.

Srivastavark · March 7, 2017, 1:46pm

Hello David ,
Many thanks for your reply. No error message in log but restore stuck in shard initialization for 4 days and nothing happened. After that we have restarted cluster and deleted the index.

The reason may be s3 had issue while we were restoring our data. See below news.

Now we have re-indexed other indexes till QA and stage environment using Logstash script and snapshot and restore but facing problem in one of the index that have more then 245 GB data.
Planning to restore from fs instead of s3. We have added path.repo in config and restart the cluster , now shard allocation is in progress for last 5 hr .

Is this approach fine to restore ? Why shard initialization is taking more time ? is there any way to monitor background activities other than /_nodes/hot_threads or /_cat/shards

dadoonet · March 7, 2017, 3:06pm

It's fine to copy manually all S3 content locally on a shared FS mounted on every node and do the restore from here.
There are some settings which limit the restore operation so it does not overload your cluster for normal operations.
May be you can speedup a bit things.

IIRC this setting might have an effect: indices.recovery.max_bytes_per_sec it's 40mb/s by default. See Indices Recovery | Elasticsearch Guide [5.2] | Elastic

Srivastavark · March 7, 2017, 6:12pm

Thanks David , we are about to start restore process using this approach and have already increased bytes_per_sec to 500MB/sec. Only waiting for cluster health become green, after 12 hr still in shard initialization state. don't know why its taking so much time.

dadoonet · March 7, 2017, 6:41pm

Can you check your threadpool status?

Especially the snapshot one?

https://www.elastic.co/guide/en/elasticsearch/reference/5.2/modules-threadpool.html#types

And may be increase its size?

dadoonet · March 7, 2017, 6:59pm

Also there is max_restore_bytes_per_sec setting which you can set on the restore operation: https://www.elastic.co/guide/en/elasticsearch/reference/5.2/modules-snapshots.html#_shared_file_system_repository

Default to 40mb as well.

Also look at https://www.elastic.co/guide/en/elasticsearch/reference/2.4/recovery.html

indices.recovery.concurrent_streams and indices.recovery.max_bytes_per_sec might help as well.

Srivastavark · March 7, 2017, 7:05pm

Hi David ,
We have already increased bytes_per_sec to 500MB/sec from 40 mb and will check other option also. Just to update here
_recovery - returns blank response
_tasks - returns like below

CXveFzvHS-GMv-h-uf4RVA:2083" : {
"node" : "CXveFzvHS-GMv-h-uf4RVA",
"id" : 2083,
"type" : "netty",
"action" : "internal:discovery/zen/publish",
"start_time_in_millis" : 1488870873680,
"running_time_in_nanos" : 40744147056568
}

dadoonet · March 7, 2017, 7:18pm

Can you give the full output please? Also add ?human parameter so it might be more readable. And finally please format using

```
CODE
```

Srivastavark · March 7, 2017, 7:47pm

Hi David ,

Output length is too big. PFB _tasks response from one node. other node have similar response.

 "oXAqfp5CRiyPrUb5jCfNBg" : {
      "name" : "prd-use1d-pr-ab-zyxw-esrc-0001",
      "transport_address" : "xx.xxx.4.63:9300",
      "host" : "xx.xxx.4.63",
      "ip" : "xx.xxx.4.63:9300",
      "attributes" : {
        "master" : "true"
      },
      "tasks" : {
        "oXAqfp5CRiyPrUb5jCfNBg:5473" : {
          "node" : "oXAqfp5CRiyPrUb5jCfNBg",
          "id" : 5473,
          "type" : "netty",
          "action" : "internal:discovery/zen/publish",
          "start_time_in_millis" : 1488871292421,
          "running_time_in_nanos" : 43192550832834
        },
        "oXAqfp5CRiyPrUb5jCfNBg:53732" : {
          "node" : "oXAqfp5CRiyPrUb5jCfNBg",
          "id" : 53732,
          "type" : "netty",
          "action" : "internal:discovery/zen/publish",
          "start_time_in_millis" : 1488903072243,
          "running_time_in_nanos" : 11412729618015
        },
        "oXAqfp5CRiyPrUb5jCfNBg:4901" : {
          "node" : "oXAqfp5CRiyPrUb5jCfNBg",
          "id" : 4901,
          "type" : "netty",
          "action" : "internal:discovery/zen/publish",
          "start_time_in_millis" : 1488870843135,
          "running_time_in_nanos" : 43641836878675
        },
        "oXAqfp5CRiyPrUb5jCfNBg:6088" : {
          "node" : "oXAqfp5CRiyPrUb5jCfNBg",
          "id" : 6088,
          "type" : "netty",
          "action" : "internal:discovery/zen/publish",
          "start_time_in_millis" : 1488871683504,
          "running_time_in_nanos" : 42801468033617
        },
        "oXAqfp5CRiyPrUb5jCfNBg:4937" : {
          "node" : "oXAqfp5CRiyPrUb5jCfNBg",
          "id" : 4937,
          "type" : "netty",
          "action" : "internal:discovery/zen/publish",
          "start_time_in_millis" : 1488870873686,
          "running_time_in_nanos" : 43611286065514
        },
        "oXAqfp5CRiyPrUb5jCfNBg:5514" : {
          "node" : "oXAqfp5CRiyPrUb5jCfNBg",
          "id" : 5514,
          "type" : "netty",
          "action" : "internal:discovery/zen/publish",
          "start_time_in_millis" : 1488871327519,
          "running_time_in_nanos" : 43157452870427
        },
        "oXAqfp5CRiyPrUb5jCfNBg:54634" : {
          "node" : "oXAqfp5CRiyPrUb5jCfNBg",
          "id" : 54634,
          "type" : "netty",
          "action" : "internal:discovery/zen/publish",
          "start_time_in_millis" : 1488903514048,
          "running_time_in_nanos" : 10970924290685
        },
        "oXAqfp5CRiyPrUb5jCfNBg:50065" : {
          "node" : "oXAqfp5CRiyPrUb5jCfNBg",
          "id" : 50065,
          "type" : "netty",
          "action" : "internal:discovery/zen/publish",
          "start_time_in_millis" : 1488901140786,
          "running_time_in_nanos" : 13344186682663
        },
        "oXAqfp5CRiyPrUb5jCfNBg:54993" : {
          "node" : "oXAqfp5CRiyPrUb5jCfNBg",
          "id" : 54993,
          "type" : "netty",
          "action" : "internal:discovery/zen/publish",
          "start_time_in_millis" : 1488903687834,
          "running_time_in_nanos" : 10797138729233
        },
        "oXAqfp5CRiyPrUb5jCfNBg:5554" : {
          "node" : "oXAqfp5CRiyPrUb5jCfNBg",
          "id" : 5554,
          "type" : "netty",
          "action" : "internal:discovery/zen/publish",
          "start_time_in_millis" : 1488871357685,
          "running_time_in_nanos" : 43127286906949
        },
        "oXAqfp5CRiyPrUb5jCfNBg:70867" : {
          "node" : "oXAqfp5CRiyPrUb5jCfNBg",
          "id" : 70867,
          "type" : "netty",
          "action" : "cluster:monitor/tasks/lists[n]",
          "start_time_in_millis" : 1488914484972,
          "running_time_in_nanos" : 173419,
          "parent_task_id" : "HvHT79ERQw-MF8LAF8fqVw:235249"
        },
        "oXAqfp5CRiyPrUb5jCfNBg:4755" : {
          "node" : "oXAqfp5CRiyPrUb5jCfNBg",
          "id" : 4755,
          "type" : "netty",
          "action" : "internal:discovery/zen/publish",
          "start_time_in_millis" : 1488870720761,
          "running_time_in_nanos" : 43764211098258
        },
        "oXAqfp5CRiyPrUb5jCfNBg:4791" : {
          "node" : "oXAqfp5CRiyPrUb5jCfNBg",
          "id" : 4791,
          "type" : "netty",
          "action" : "internal:discovery/zen/publish",
          "start_time_in_millis" : 1488870750764,
          "running_time_in_nanos" : 43734207930556
        },
        "oXAqfp5CRiyPrUb5jCfNBg:70557" : {
          "node" : "oXAqfp5CRiyPrUb5jCfNBg",
          "id" : 70557,
          "type" : "netty",
          "action" : "internal:discovery/zen/publish",
          "start_time_in_millis" : 1488914334091,
          "running_time_in_nanos" : 150881287207
        },
        "oXAqfp5CRiyPrUb5jCfNBg:4831" : {
          "node" : "oXAqfp5CRiyPrUb5jCfNBg",
          "id" : 4831,
          "type" : "netty",
          "action" : "internal:discovery/zen/publish",
          "start_time_in_millis" : 1488870781325,
          "running_time_in_nanos" : 43703647482110
        },
        "oXAqfp5CRiyPrUb5jCfNBg:49247" : {
          "node" : "oXAqfp5CRiyPrUb5jCfNBg",
          "id" : 49247,
          "type" : "netty",
          "action" : "internal:discovery/zen/publish",
          "start_time_in_millis" : 1488900648449,
          "running_time_in_nanos" : 13836522971621
        },
        "oXAqfp5CRiyPrUb5jCfNBg:49919" : {
          "node" : "oXAqfp5CRiyPrUb5jCfNBg",
          "id" : 49919,
          "type" : "netty",
          "action" : "internal:discovery/zen/publish",
          "start_time_in_millis" : 1488901053079,
          "running_time_in_nanos" : 13431893232057
        }
      }
    }

Srivastavark · March 7, 2017, 8:14pm

also some _nodes/hot_threads from a node

100.3% (501.3ms out of 500ms) cpu usage by thread 'elasticsearch[prd-use1d-xx.xxx-xxxxx-esrc-0002][clusterService#updateTask][T#1]'
     4/10 snapshots sharing following 18 elements
       java.lang.Object.<init>(Object.java:37)
       java.util.zip.Deflater.<init>(Deflater.java:168)
       org.elasticsearch.common.compress.deflate.DeflateCompressor.streamOutput(DeflateCompressor.java:126)
       org.elasticsearch.common.compress.CompressedXContent.<init>(CompressedXContent.java:83)
       org.elasticsearch.index.mapper.DocumentMapper.<init>(DocumentMapper.java:209)
       org.elasticsearch.index.mapper.DocumentMapper.updateFieldType(DocumentMapper.java:385)
       org.elasticsearch.index.mapper.MapperService.merge(MapperService.java:413)
       org.elasticsearch.index.mapper.MapperService.merge(MapperService.java:320)
       org.elasticsearch.indices.cluster.IndicesClusterStateService.processMapping(IndicesClusterStateService.java:406)
       org.elasticsearch.indices.cluster.IndicesClusterStateService.applyMappings(IndicesClusterStateService.java:367)
       org.elasticsearch.indices.cluster.IndicesClusterStateService.clusterChanged(IndicesClusterStateService.java:175)
       org.elasticsearch.cluster.service.InternalClusterService.runTasksForExecutor(InternalClusterService.java:610)
       org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:772)
       org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:231)
       org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:194)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
       java.lang.Thread.run(Thread.java:745)
     3/10 snapshots sharing following 25 elements
       java.util.zip.Deflater.deflateBytes(Native Method)
       java.util.zip.Deflater.deflate(Deflater.java:442)
       java.util.zip.DeflaterOutputStream.flush(DeflaterOutputStream.java:275)
       java.io.BufferedOutputStream.flush(BufferedOutputStream.java:141)
       org.elasticsearch.common.io.stream.OutputStreamStreamOutput.flush(OutputStreamStreamOutput.java:47)
       java.io.FilterOutputStream.flush(FilterOutputStream.java:140)
       java.io.FilterOutputStream.close(FilterOutputStream.java:158)
       com.fasterxml.jackson.core.json.UTF8JsonGenerator.close(UTF8JsonGenerator.java:1068)
       org.elasticsearch.common.xcontent.json.JsonXContentGenerator.close(JsonXContentGenerator.java:448)
       org.elasticsearch.common.xcontent.XContentBuilder.close(XContentBuilder.java:1200)
       org.elasticsearch.common.compress.CompressedXContent.<init>(CompressedXContent.java:90)
       org.elasticsearch.index.mapper.DocumentMapper.<init>(DocumentMapper.java:209)
       org.elasticsearch.index.mapper.DocumentMapper.updateFieldType(DocumentMapper.java:385)
       org.elasticsearch.index.mapper.MapperService.merge(MapperService.java:413)
       org.elasticsearch.index.mapper.MapperService.merge(MapperService.java:320)
       org.elasticsearch.indices.cluster.IndicesClusterStateService.processMapping(IndicesClusterStateService.java:406)
       org.elasticsearch.indices.cluster.IndicesClusterStateService.applyMappings(IndicesClusterStateService.java:367)
       org.elasticsearch.indices.cluster.IndicesClusterStateService.clusterChanged(IndicesClusterStateService.java:175)
       org.elasticsearch.cluster.service.InternalClusterService.runTasksForExecutor(InternalClusterService.java:610)
       org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:772)
       org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:231)
       org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:194)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
       java.lang.Thread.run(Thread.java:745)

dadoonet · March 8, 2017, 2:55pm

If you run multiple times the same hotthreads are you still seeing the same thread usage?

Looking at your logs it does not seem that restore operation is really running or actually doing anything.

system · April 5, 2017, 2:55pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Recovery from S3 gateway - only one shard recovers? Elasticsearch	10	493	July 6, 2017
Setting path.data stops the EC2 stuff Elasticsearch	13	707	July 6, 2017
Details on snapshot and restore in ES 1.0 Elasticsearch	16	527	July 6, 2017
Index Backups to S3? Elasticsearch	8	1708	July 6, 2017
Backing Up ES Elasticsearch	9	458	July 6, 2017

Snapshot Restore from s3 or fs

Related topics