Snapshot Restore from s3 or fs

Hello Experts,

I am working on ES upgrade from 1.4.4 to 2.4.0 and want to migrate single index data(250 GB) from cluster A(1.4.4) to Cluster B(2.4.0). Earlier i have taken snapshot on S3 and tried to restore in Cluster B(2.4.0) , it took 5 days to create the index and mapping and finally got failed.
Now i have copied snapshot from s3 to fs and want to restore from fs ( path.repo : /data/s3data/).

Is this approach would be better to restore ?
Clsuter Detail -- Below
Cluster - 1 Master, 2 Master-data, 2 data
All master-data and data node and 64GB RAM and 1.29 TB space . Master node have 32 GB and 660 GB space.
Heap - not more than half of total available memory.

Yml config - all default except below

cluster.name: myname
node.name: prodnode1
node.master: true
node.data: true
path.data: /data/elasticsearch
path.logs: /var/log/elasticsearch
path.repo: /data/datas3/elastic-search-backup
bootstrap.memory_lock: true
network.host: hostIP
discovery.zen.ping.unicast.hosts: ["master-only", "master-data", "master-data"]
discovery.zen.minimum_master_nodes: 2

Any help appreciated. Thanks in advance.

Regards
Mithlesh

What was the error message?

Note that IMO it's better to first upgrade to 1.7, install the migration plugin and check that upgrade can be done. Look at Breaking changes in 2.0 | Elasticsearch Guide [2.4] | Elastic

My advice would be though to create a brand new 5.2 cluster and use instead reindex from remote API to reindex your data existing in 1.4 cluster directly in a 5.x cluster.

Hello David ,
Many thanks for your reply. No error message in log but restore stuck in shard initialization for 4 days and nothing happened. After that we have restarted cluster and deleted the index.

The reason may be s3 had issue while we were restoring our data. See below news.

Now we have re-indexed other indexes till QA and stage environment using Logstash script and snapshot and restore but facing problem in one of the index that have more then 245 GB data.
Planning to restore from fs instead of s3. We have added path.repo in config and restart the cluster , now shard allocation is in progress for last 5 hr .

Is this approach fine to restore ? Why shard initialization is taking more time ? is there any way to monitor background activities other than /_nodes/hot_threads or /_cat/shards

It's fine to copy manually all S3 content locally on a shared FS mounted on every node and do the restore from here.
There are some settings which limit the restore operation so it does not overload your cluster for normal operations.
May be you can speedup a bit things.

IIRC this setting might have an effect: indices.recovery.max_bytes_per_sec it's 40mb/s by default. See Indices Recovery | Elasticsearch Guide [5.2] | Elastic

Thanks David , we are about to start restore process using this approach and have already increased bytes_per_sec to 500MB/sec. Only waiting for cluster health become green, after 12 hr still in shard initialization state. don't know why its taking so much time.

Can you check your threadpool status?

Especially the snapshot one?

https://www.elastic.co/guide/en/elasticsearch/reference/5.2/modules-threadpool.html#types

And may be increase its size?

Also there is max_restore_bytes_per_sec setting which you can set on the restore operation: https://www.elastic.co/guide/en/elasticsearch/reference/5.2/modules-snapshots.html#_shared_file_system_repository

Default to 40mb as well.

Also look at https://www.elastic.co/guide/en/elasticsearch/reference/2.4/recovery.html

indices.recovery.concurrent_streams and indices.recovery.max_bytes_per_sec might help as well.

Hi David ,
We have already increased bytes_per_sec to 500MB/sec from 40 mb and will check other option also. Just to update here
_recovery - returns blank response
_tasks - returns like below

CXveFzvHS-GMv-h-uf4RVA:2083" : {
"node" : "CXveFzvHS-GMv-h-uf4RVA",
"id" : 2083,
"type" : "netty",
"action" : "internal:discovery/zen/publish",
"start_time_in_millis" : 1488870873680,
"running_time_in_nanos" : 40744147056568
}

Can you give the full output please? Also add ?human parameter so it might be more readable. And finally please format using

```
CODE
```

Hi David ,

Output length is too big. PFB _tasks response from one node. other node have similar response.

 "oXAqfp5CRiyPrUb5jCfNBg" : {
      "name" : "prd-use1d-pr-ab-zyxw-esrc-0001",
      "transport_address" : "xx.xxx.4.63:9300",
      "host" : "xx.xxx.4.63",
      "ip" : "xx.xxx.4.63:9300",
      "attributes" : {
        "master" : "true"
      },
      "tasks" : {
        "oXAqfp5CRiyPrUb5jCfNBg:5473" : {
          "node" : "oXAqfp5CRiyPrUb5jCfNBg",
          "id" : 5473,
          "type" : "netty",
          "action" : "internal:discovery/zen/publish",
          "start_time_in_millis" : 1488871292421,
          "running_time_in_nanos" : 43192550832834
        },
        "oXAqfp5CRiyPrUb5jCfNBg:53732" : {
          "node" : "oXAqfp5CRiyPrUb5jCfNBg",
          "id" : 53732,
          "type" : "netty",
          "action" : "internal:discovery/zen/publish",
          "start_time_in_millis" : 1488903072243,
          "running_time_in_nanos" : 11412729618015
        },
        "oXAqfp5CRiyPrUb5jCfNBg:4901" : {
          "node" : "oXAqfp5CRiyPrUb5jCfNBg",
          "id" : 4901,
          "type" : "netty",
          "action" : "internal:discovery/zen/publish",
          "start_time_in_millis" : 1488870843135,
          "running_time_in_nanos" : 43641836878675
        },
        "oXAqfp5CRiyPrUb5jCfNBg:6088" : {
          "node" : "oXAqfp5CRiyPrUb5jCfNBg",
          "id" : 6088,
          "type" : "netty",
          "action" : "internal:discovery/zen/publish",
          "start_time_in_millis" : 1488871683504,
          "running_time_in_nanos" : 42801468033617
        },
        "oXAqfp5CRiyPrUb5jCfNBg:4937" : {
          "node" : "oXAqfp5CRiyPrUb5jCfNBg",
          "id" : 4937,
          "type" : "netty",
          "action" : "internal:discovery/zen/publish",
          "start_time_in_millis" : 1488870873686,
          "running_time_in_nanos" : 43611286065514
        },
        "oXAqfp5CRiyPrUb5jCfNBg:5514" : {
          "node" : "oXAqfp5CRiyPrUb5jCfNBg",
          "id" : 5514,
          "type" : "netty",
          "action" : "internal:discovery/zen/publish",
          "start_time_in_millis" : 1488871327519,
          "running_time_in_nanos" : 43157452870427
        },
        "oXAqfp5CRiyPrUb5jCfNBg:54634" : {
          "node" : "oXAqfp5CRiyPrUb5jCfNBg",
          "id" : 54634,
          "type" : "netty",
          "action" : "internal:discovery/zen/publish",
          "start_time_in_millis" : 1488903514048,
          "running_time_in_nanos" : 10970924290685
        },
        "oXAqfp5CRiyPrUb5jCfNBg:50065" : {
          "node" : "oXAqfp5CRiyPrUb5jCfNBg",
          "id" : 50065,
          "type" : "netty",
          "action" : "internal:discovery/zen/publish",
          "start_time_in_millis" : 1488901140786,
          "running_time_in_nanos" : 13344186682663
        },
        "oXAqfp5CRiyPrUb5jCfNBg:54993" : {
          "node" : "oXAqfp5CRiyPrUb5jCfNBg",
          "id" : 54993,
          "type" : "netty",
          "action" : "internal:discovery/zen/publish",
          "start_time_in_millis" : 1488903687834,
          "running_time_in_nanos" : 10797138729233
        },
        "oXAqfp5CRiyPrUb5jCfNBg:5554" : {
          "node" : "oXAqfp5CRiyPrUb5jCfNBg",
          "id" : 5554,
          "type" : "netty",
          "action" : "internal:discovery/zen/publish",
          "start_time_in_millis" : 1488871357685,
          "running_time_in_nanos" : 43127286906949
        },
        "oXAqfp5CRiyPrUb5jCfNBg:70867" : {
          "node" : "oXAqfp5CRiyPrUb5jCfNBg",
          "id" : 70867,
          "type" : "netty",
          "action" : "cluster:monitor/tasks/lists[n]",
          "start_time_in_millis" : 1488914484972,
          "running_time_in_nanos" : 173419,
          "parent_task_id" : "HvHT79ERQw-MF8LAF8fqVw:235249"
        },
        "oXAqfp5CRiyPrUb5jCfNBg:4755" : {
          "node" : "oXAqfp5CRiyPrUb5jCfNBg",
          "id" : 4755,
          "type" : "netty",
          "action" : "internal:discovery/zen/publish",
          "start_time_in_millis" : 1488870720761,
          "running_time_in_nanos" : 43764211098258
        },
        "oXAqfp5CRiyPrUb5jCfNBg:4791" : {
          "node" : "oXAqfp5CRiyPrUb5jCfNBg",
          "id" : 4791,
          "type" : "netty",
          "action" : "internal:discovery/zen/publish",
          "start_time_in_millis" : 1488870750764,
          "running_time_in_nanos" : 43734207930556
        },
        "oXAqfp5CRiyPrUb5jCfNBg:70557" : {
          "node" : "oXAqfp5CRiyPrUb5jCfNBg",
          "id" : 70557,
          "type" : "netty",
          "action" : "internal:discovery/zen/publish",
          "start_time_in_millis" : 1488914334091,
          "running_time_in_nanos" : 150881287207
        },
        "oXAqfp5CRiyPrUb5jCfNBg:4831" : {
          "node" : "oXAqfp5CRiyPrUb5jCfNBg",
          "id" : 4831,
          "type" : "netty",
          "action" : "internal:discovery/zen/publish",
          "start_time_in_millis" : 1488870781325,
          "running_time_in_nanos" : 43703647482110
        },
        "oXAqfp5CRiyPrUb5jCfNBg:49247" : {
          "node" : "oXAqfp5CRiyPrUb5jCfNBg",
          "id" : 49247,
          "type" : "netty",
          "action" : "internal:discovery/zen/publish",
          "start_time_in_millis" : 1488900648449,
          "running_time_in_nanos" : 13836522971621
        },
        "oXAqfp5CRiyPrUb5jCfNBg:49919" : {
          "node" : "oXAqfp5CRiyPrUb5jCfNBg",
          "id" : 49919,
          "type" : "netty",
          "action" : "internal:discovery/zen/publish",
          "start_time_in_millis" : 1488901053079,
          "running_time_in_nanos" : 13431893232057
        }
      }
    }

also some _nodes/hot_threads from a node

100.3% (501.3ms out of 500ms) cpu usage by thread 'elasticsearch[prd-use1d-xx.xxx-xxxxx-esrc-0002][clusterService#updateTask][T#1]'
     4/10 snapshots sharing following 18 elements
       java.lang.Object.<init>(Object.java:37)
       java.util.zip.Deflater.<init>(Deflater.java:168)
       org.elasticsearch.common.compress.deflate.DeflateCompressor.streamOutput(DeflateCompressor.java:126)
       org.elasticsearch.common.compress.CompressedXContent.<init>(CompressedXContent.java:83)
       org.elasticsearch.index.mapper.DocumentMapper.<init>(DocumentMapper.java:209)
       org.elasticsearch.index.mapper.DocumentMapper.updateFieldType(DocumentMapper.java:385)
       org.elasticsearch.index.mapper.MapperService.merge(MapperService.java:413)
       org.elasticsearch.index.mapper.MapperService.merge(MapperService.java:320)
       org.elasticsearch.indices.cluster.IndicesClusterStateService.processMapping(IndicesClusterStateService.java:406)
       org.elasticsearch.indices.cluster.IndicesClusterStateService.applyMappings(IndicesClusterStateService.java:367)
       org.elasticsearch.indices.cluster.IndicesClusterStateService.clusterChanged(IndicesClusterStateService.java:175)
       org.elasticsearch.cluster.service.InternalClusterService.runTasksForExecutor(InternalClusterService.java:610)
       org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:772)
       org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:231)
       org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:194)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
       java.lang.Thread.run(Thread.java:745)
     3/10 snapshots sharing following 25 elements
       java.util.zip.Deflater.deflateBytes(Native Method)
       java.util.zip.Deflater.deflate(Deflater.java:442)
       java.util.zip.DeflaterOutputStream.flush(DeflaterOutputStream.java:275)
       java.io.BufferedOutputStream.flush(BufferedOutputStream.java:141)
       org.elasticsearch.common.io.stream.OutputStreamStreamOutput.flush(OutputStreamStreamOutput.java:47)
       java.io.FilterOutputStream.flush(FilterOutputStream.java:140)
       java.io.FilterOutputStream.close(FilterOutputStream.java:158)
       com.fasterxml.jackson.core.json.UTF8JsonGenerator.close(UTF8JsonGenerator.java:1068)
       org.elasticsearch.common.xcontent.json.JsonXContentGenerator.close(JsonXContentGenerator.java:448)
       org.elasticsearch.common.xcontent.XContentBuilder.close(XContentBuilder.java:1200)
       org.elasticsearch.common.compress.CompressedXContent.<init>(CompressedXContent.java:90)
       org.elasticsearch.index.mapper.DocumentMapper.<init>(DocumentMapper.java:209)
       org.elasticsearch.index.mapper.DocumentMapper.updateFieldType(DocumentMapper.java:385)
       org.elasticsearch.index.mapper.MapperService.merge(MapperService.java:413)
       org.elasticsearch.index.mapper.MapperService.merge(MapperService.java:320)
       org.elasticsearch.indices.cluster.IndicesClusterStateService.processMapping(IndicesClusterStateService.java:406)
       org.elasticsearch.indices.cluster.IndicesClusterStateService.applyMappings(IndicesClusterStateService.java:367)
       org.elasticsearch.indices.cluster.IndicesClusterStateService.clusterChanged(IndicesClusterStateService.java:175)
       org.elasticsearch.cluster.service.InternalClusterService.runTasksForExecutor(InternalClusterService.java:610)
       org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:772)
       org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:231)
       org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:194)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
       java.lang.Thread.run(Thread.java:745)

If you run multiple times the same hotthreads are you still seeing the same thread usage?

Looking at your logs it does not seem that restore operation is really running or actually doing anything.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.