I am seeing a lot of snapshot threads are in sleep from hot threads .. what does this mean. the FS repository has all default settings. Also i am seeing a lot of snapshot threads are in active and in queue from thread pool. But i only have snapshot script running from master nodes (3) . There are 47 data nodes across 13 machines .
2.4% (12.1ms out of 500ms) cpu usage by thread 'elasticsearch[c346syd-data-9203][snapshot][T#3549]'
10/10 snapshots sharing following 20 elements
java.lang.Thread.sleep(Native Method)
java.lang.Thread.sleep(Thread.java:340)
org.apache.lucene.store.RateLimiter$SimpleRateLimiter.pause(RateLimiter.java:153)
org.elasticsearch.index.snapshots.blobstore.RateLimitingInputStream.maybePause(RateLimitingInputStream.java:52)
org.elasticsearch.index.snapshots.blobstore.RateLimitingInputStream.read(RateLimitingInputStream.java:71)
org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext$AbortableInputStream.read(BlobStoreRepository.java:1421)
java.io.FilterInputStream.read(FilterInputStream.java:107)
org.elasticsearch.common.io.Streams.copy(Streams.java:76)
org.elasticsearch.common.blobstore.fs.FsBlobContainer.writeBlob(FsBlobContainer.java:131)
org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshotFile(BlobStoreRepository.java:1354)
org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshot(BlobStoreRepository.java:1296)
org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:893)
org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:377)
org.elasticsearch.snapshots.SnapshotShardsService.access$200(SnapshotShardsService.java:87)
org.elasticsearch.snapshots.SnapshotShardsService$1.doRun(SnapshotShardsService.java:333)
org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:527)
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
2.4% (12.1ms out of 500ms) cpu usage by thread 'elasticsearch[c346syd-data-9203][snapshot][T#3546]'
10/10 snapshots sharing following 20 elements
java.lang.Thread.sleep(Native Method)
java.lang.Thread.sleep(Thread.java:340)
org.apache.lucene.store.RateLimiter$SimpleRateLimiter.pause(RateLimiter.java:153)
org.elasticsearch.index.snapshots.blobstore.RateLimitingInputStream.maybePause(RateLimitingInputStream.java:52)
Trying to understand how snapshot works under the hood . Issue is we are seeing the snapshot time is increasing over time... now its at ~3 hours from 15 min...and it is tending to snap a large backup file too. we only keep last 3 snaps in the same repo
Thanks for the response. also we are seeing below error on all successfully completed Snapshots when i am trying to do GET snapshot/CaymanAuditQEDB/<any_ snapshot_name>/_status.
We were seeing the same error while trying to delete old snapshots too. and also though it says Snapshots completed , upon restoring these snaps , clusters changes to red with below error
[2019-06-07T13:38:25.674Z][INFO][org.elasticsearch.cluster.routing.allocation.AllocationService][c732uyu-master-9200] Cluster health status changed from [YELLOW] to [RED] (reason: [async_shard_fetch])
Please take your time to properly format your messages with markdown, this makes them 10x easier to read. Thanks!
It seems that a file has been corrupted on the snapshot disk or has not been properly written. Is this happening with all of your snapshots? Also what Elasticsearch version are you using and what type of snapshot repo is this?
Yes, This is happening with all my snapshots. ES version is 5.1.2 and its a Shared File System Repository. During the snapshot process we don't see any error/warnings . But when we issue delete snapshot , we see "codec footer mismatch (file truncated?)" in the ES log.
We can not change the ES version currently .... but will try to take a snapshot to a different filesystem.
Question about the "codec footer mismatch (file truncated?)" , I have noticed that all these warning are coming from specific Index and shard. Do you think it could be something with the index ? But we can do search on this index just fine though.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.