Snapshot restore hanging on 6.4


(Mark Varley) #1

I have a clean install of 6.4.2 (OSS), I have create a new index (single shard), load data into it (~800mb), then force a merge to optimize and finally snapshot to S3. When I try to restore the index into a different cluster (same ES version 6.4.2 OSS) it just hangs indefinitely, no information or errors in the logs. As an experiment I tried to delete the index on the original cluster and restore from S3 and exactly the same behaviour, it just hangs and never restores.

With nothing to go on in the logs and after several hours trawling the forums I am completely stumped. We have been using Elasticsearch in production for years and rely heavily in S3 snapshot/restores and I have never come across this before. Any advice would be hugely appreciated.

ES Version: 6.4.2
AWS Instance: i3.xlarge
Storage: NVMe (Instance)
JVM: OpenJDK 1.8.0_181

/data/mycompany-etl.log

[2018-10-26T05:11:35,076][INFO ][o.e.n.Node               ] [mycompany-etl] initializing ...
[2018-10-26T05:11:35,137][INFO ][o.e.e.NodeEnvironment    ] [mycompany-etl] using [1] data paths, mounts [[/data (/dev/nvme0n1)]], net usable_space [825.5gb], net total_space [869.8gb], types [ext4]
[2018-10-26T05:11:35,137][INFO ][o.e.e.NodeEnvironment    ] [mycompany-etl] heap size [14.9gb], compressed ordinary object pointers [true]
[2018-10-26T05:11:35,138][INFO ][o.e.n.Node               ] [mycompany-etl] node name [mycompany-etl], node ID [Qt2wq6C2TXalnocWgfp2Uw]
[2018-10-26T05:11:35,138][INFO ][o.e.n.Node               ] [mycompany-etl] version[6.4.2], pid[2593], build[oss/deb/04711c2/2018-09-26T13:34:09.098244Z], OS[Linux/4.15.0-1023-aws/amd64], JVM[Oracle Corporation/OpenJDK 64-Bit Server VM/1.8.0_181/25.181-b13]
[2018-10-26T05:11:35,138][INFO ][o.e.n.Node               ] [mycompany-etl] JVM arguments [-Xms1g, -Xmx1g, -XX:+UseConcMarkSweepGC, -XX:CMSInitiatingOccupancyFraction=75, -XX:+UseCMSInitiatingOccupancyOnly, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -XX:-OmitStackTraceInFastThrow, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Djava.io.tmpdir=/tmp/elasticsearch.VV6nHeo5, -XX:+HeapDumpOnOutOfMemoryError, -XX:HeapDumpPath=/var/lib/elasticsearch, -XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log, -XX:+PrintGCDetails, -XX:+PrintGCDateStamps, -XX:+PrintTenuringDistribution, -XX:+PrintGCApplicationStoppedTime, -Xloggc:/var/log/elasticsearch/gc.log, -XX:+UseGCLogFileRotation, -XX:NumberOfGCLogFiles=32, -XX:GCLogFileSize=64m, -Xms15g, -Xmx15g, -Des.path.home=/usr/share/elasticsearch, -Des.path.conf=/etc/elasticsearch, -Des.distribution.flavor=oss, -Des.distribution.type=deb]
[2018-10-26T05:11:36,391][INFO ][o.e.p.PluginsService     ] [mycompany-etl] loaded plugin [repository-s3]
[2018-10-26T05:11:40,100][INFO ][o.e.d.DiscoveryModule    ] [mycompany-etl] using discovery type [zen]
[2018-10-26T05:11:40,601][INFO ][o.e.n.Node               ] [mycompany-etl] initialized
[2018-10-26T05:11:40,601][INFO ][o.e.n.Node               ] [mycompany-etl] starting ...
[2018-10-26T05:11:40,852][INFO ][o.e.t.TransportService   ] [mycompany-etl] publish_address {127.0.0.1:9300}, bound_addresses {[::1]:9300}, {127.0.0.1:9300}
[2018-10-26T05:11:43,914][INFO ][o.e.c.s.MasterService    ] [mycompany-etl] zen-disco-elected-as-master ([0] nodes joined)[, ], reason: new_master {mycompany-etl}{Qt2wq6C2TXalnocWgfp2Uw}{JsiT2G5-TTGoqUQdEC7SlA}{127.0.0.1}{127.0.0.1:9300}
[2018-10-26T05:11:43,919][INFO ][o.e.c.s.ClusterApplierService] [mycompany-etl] new_master {mycompany-etl}{Qt2wq6C2TXalnocWgfp2Uw}{JsiT2G5-TTGoqUQdEC7SlA}{127.0.0.1}{127.0.0.1:9300}, reason: apply cluster state (from master [master {mycompany-etl}{Qt2wq6C2TXalnocWgfp2Uw}{JsiT2G5-TTGoqUQdEC7SlA}{127.0.0.1}{127.0.0.1:9300} committed version [1] source [zen-disco-elected-as-master ([0] nodes joined)[, ]]])
[2018-10-26T05:11:43,959][INFO ][o.e.h.n.Netty4HttpServerTransport] [mycompany-etl] publish_address {127.0.0.1:9200}, bound_addresses {[::1]:9200}, {127.0.0.1:9200}
[2018-10-26T05:11:43,959][INFO ][o.e.n.Node               ] [mycompany-etl] started
[2018-10-26T05:11:43,973][INFO ][o.e.g.GatewayService     ] [mycompany-etl] recovered [0] indices into cluster_state
[2018-10-26T05:13:02,218][INFO ][o.e.c.m.MetaDataCreateIndexService] [mycompany-etl] [mat_abpris_20181026] creating index, cause [auto(bulk api)], templates [], shards [5]/[1], mappings []
[2018-10-26T05:13:02,610][INFO ][o.e.c.m.MetaDataMappingService] [mycompany-etl] [mat_abpris_20181026/GsjlWFkuT3yrCD2QwEcn7w] create_mapping [_doc]
[2018-10-26T05:13:02,624][INFO ][o.e.c.m.MetaDataMappingService] [mycompany-etl] [mat_abpris_20181026/GsjlWFkuT3yrCD2QwEcn7w] update_mapping [_doc]
[2018-10-26T05:17:08,191][INFO ][o.e.c.m.MetaDataDeleteIndexService] [mycompany-etl] [mat_abpris_20181026/GsjlWFkuT3yrCD2QwEcn7w] deleting index
[2018-10-26T05:24:35,337][INFO ][o.e.c.m.MetaDataUpdateSettingsService] [mycompany-etl] updating number_of_replicas to [0] for indices [mat_abpris_20181026]
[2018-10-26T05:24:35,357][INFO ][o.e.c.s.IndexScopedSettings] [mycompany-etl] updating [index.refresh_interval] from [-1] to [60s]
[2018-10-26T05:25:15,928][INFO ][o.e.s.SnapshotsService   ] [mycompany-etl] snapshot [snapshot_mycompany:mat_abpris_20181026/unJsQwyMQeyYlyKZVm6Hrw] started
[2018-10-26T05:25:38,072][INFO ][o.e.s.SnapshotsService   ] [mycompany-etl] snapshot [snapshot_mycompany:mat_abpris_20181026/unJsQwyMQeyYlyKZVm6Hrw] completed with state [SUCCESS]

GET /_cluster/allocation/explain

{
   "index": "mat_abpris_20181026",
   "shard": 0,
   "primary": true,
   "current_state": "initializing",
   "unassigned_info": {
      "reason": "NEW_INDEX_RESTORED",
      "at": "2018-10-26T05:42:14.615Z",
      "details": "restore_source[snapshot_mycompany/mat_abpris_20181026]",
      "last_allocation_status": "awaiting_info"
   },
   "current_node": {
      "id": "Qt2wq6C2TXalnocWgfp2Uw",
      "name": "mycompany-etl",
      "transport_address": "127.0.0.1:9300"
   },
   "explanation": "the shard is in the process of initializing on node [mycompany-etl], wait until initialization has completed"
}

/_cat/recovery?v

index               shard time type     stage source_host source_node target_host target_node      repository            snapshot            files files_recovered files_percent files_total bytes     bytes_recovered bytes_percent bytes_total translog_ops translog_ops_recovered translog_ops_percent
mat_abpris_20181026 0     3.5m snapshot index n/a         n/a         127.0.0.1   mycompany-etl snapshot_mycompany mat_abpris_20181026 15    9               60.0%         15          645607266 116736429       18.1%         645607266   0            0                      100.0%

(Xavier Facq) #2

Hi

An idea: does your new cluster (with only 1 node) has or is a data-node ?

bye,
XAvier


(Mark Varley) #3

Thanks for the reply Xavier. Yes, I have a single node cluster, the node is both master and data. The node has the same name as the cluster. I wonder if that would cause an issue? Nothing in the logs though to suggest that is a problem. It's really odd.

If anyone has suggestions of some way of getting more insight into the problem that would be great. Have set ES log level to debug but very little information on the restore process.

Thanks,

Mark


(Christian Dahlqvist) #4

How much disk space do you have left? What command/settings are you using when restoring the snapshot?


(Mark Varley) #5

Plenty, it's an 870gb disk and only 7% used. Does ES write any temp files? The root volume is quite small but seems to have a reasonable amount of space:

Filesystem     1K-blocks     Used Available Use% Mounted on
udev            15683608        0  15683608   0% /dev
tmpfs            3139128      788   3138340   1% /run
/dev/xvda1       8065444  2378864   5670196  30% /
tmpfs           15695636        0  15695636   0% /dev/shm
tmpfs               5120        0      5120   0% /run/lock
tmpfs           15695636        0  15695636   0% /sys/fs/cgroup
/dev/loop0         13056    13056         0 100% /snap/amazon-ssm-agent/495
/dev/loop1         89984    89984         0 100% /snap/core/5548
/dev/loop2         90112    90112         0 100% /snap/core/5328
/dev/loop3         16768    16768         0 100% /snap/amazon-ssm-agent/734
/dev/loop4         89984    89984         0 100% /snap/core/5662
/dev/loop5         16896    16896         0 100% /snap/amazon-ssm-agent/784
/dev/nvme0n1   912122860 20426172 845293588   3% /data
tmpfs            3139124        0   3139124   0% /run/user/1000

(Mark Varley) #6

Sorry missed the second part of your question:

POST /_snapshot/snapshot_mycompany/mat_abpris_20181026/_restore?wait_for_completion=true

(Mark Varley) #7

After a day of battling with this issue which included the following....

  • Swapping out OpenJDK 8 for Oracle Java 8 (no difference)
  • Swapping S3 for snapshot for FS snapshot on local disk (works perfectly but not suitable for my use case of snapshotting and index on an ETL cluster and restoring to Staging and Prod)
  • Experimenting with S3 snapshot params, mainly chunk_size and buffer_size (no difference)
  • Trying different heap configurations on ES (no difference)

... I finally threw in the towel and tried a new OS, launching exactly the same server spec but with AWS Linux 2 rather than Ubuntu 18.04 and everything works perfectly. Build script is the same (just swapped out apt for yum), ES version the same, reverted to OpenJDK and all working. Frustrating but great to be back up and running and thank you for your suggestions.


(system) #8

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.